Search CORE

377 research outputs found

Principled Approaches to Last-Level Cache Management

Author: Teran Elvira
Publication venue
Publication date: 05/02/2018
Field of study

Memory is a critical component of all computing systems. It represents a fundamental performance and energy bottleneck. Ideally, memory aspects such as energy cost, performance, and the cost of implementing management techniques would scale together with the size of all different computing systems; unfortunately this is not the case. With the upcoming trends in applications, new memory technologies, etc., scaling becomes a bigger a problem, aggravating the performance bottleneck that memory represents. A memory hierarchy was proposed to alleviate the problem. Each level in the hierarchy tends to have a decreasing cost per bit, an increased capacity, and a higher access time compared to its previous level. Preferably all data will be stored in the fastest level of memory, unfortunately, faster memory technologies tend to be associated with a higher manufacturing cost, which often limits their capacity. The design challenge is, to determine which is the frequently used data, and store it in the faster levels of memory. A cache is a small, fast, on-chip chunk of memory. Any data stored in main memory can be stored in the cache. For many programs, a typical behavior is to access data that has been accessed previously. Taking advantage of this behavior, a copy of frequently accessed data is kept in the cache, in order to provide a faster access time next time is requested. Due to capacity constrains, it is likely that all of the frequently reused data cannot fit in the cache, because of this, cache management policies decide which data is to be kept in the cache, and which in other levels of the memory hierarchy. Under an efficient cache management policy, an encouraging amount of memory requests will be serviced from a fast on-chip cache. The disparity in access latency between the last-level cache and main memory motivates the search for efficient cache management policies. There is a great amount of recently proposed work that strives to utilize cache capacity in the most favorable to performance way possible. Related work focus on optimizing the performance of caches focusing on different possible solutions, e.g. reduce miss rate, consume less power, reducing storage overhead, reduce access latency, etc. Our work focus on improving the performance of last-level caches by designing policies based on principles adapted from other areas of interest. In this dissertation, we focus on several aspects of cache management policies, we first introduce a space-efficient placement and promotion policy which goal is to minimize the updates to the replacement policy state on each cache access. We further introduce a mechanism that predicts whether a block in the cache will be reused, it feeds different features from a block to the predictor in order to increase the correlation of a previous access to a future access. We later introduce a technique that tweaks traditional cache indexing, providing fast accesses to a vast majority of requests in the presence of a slow access memory technology such as DRAM

Texas A&M Repository

Ultra low power cooperative branch prediction

Author: Bielby Matthew Iain
Publication venue: The University of Edinburgh
Publication date: 26/11/2015
Field of study

Branch Prediction is a key task in the operation of a high performance processor. An inaccurate branch predictor results in increased program run-time and a rise in energy consumption. The drive towards processors with limited die-space and tighter energy requirements will continue to intensify over the coming years, as will the shift towards increasingly multicore processors. Both trends make it increasingly important and increasingly difficult to find effective and efficient branch predictor designs. This thesis presents savings in energy and die-space through the use of more efficient cooperative branch predictors achieved through novel branch prediction designs. The first contribution is a new take on the problem of a hybrid dynamic-static branch predictor allocating branches to be predicted by one of its sub-predictors. A new bias parameter is introduced as a mechanism for trading off a small amount of performance for savings in die-space and energy. This is achieved by predicting more branches with the static predictor, ensuring that only the branches that will most benefit from the dynamic predictor’s resources are predicted dynamically. This reduces pressure on the dynamic predictor’s resources allowing for a smaller predictor to achieve very high accuracy. An improvement in run-time of 7-8% over the baseline BTFN predictor is observed at a cost of a branch predictor bits budget of much less than 1KB. Next, a novel approach to branch prediction for multicore data-parallel applications is presented. The Peloton branch prediction scheme uses a pack of cyclists as an illustration of how a group of processors running similar tasks can share branch predictions to improve accuracy and reduce runtime. The results show that sharing updates for conditional branches across the existing interconnect for I-cache and D-cache updates results in a reduction of mispredictions of up to 25% and a reduction in run-time of up to 6%. McPAT is used to present an energy model that suggests the savings are achieved at little to no increase in energy required. The technique is then extended to architectures where the size of the branch predictors may differ between cores. The results show that such heterogeneity can dramatically reduce the die-space required for an accurate branch predictor while having little impact on performance and up to 9% energy savings. The approach can be combined with the Peloton branch prediction scheme for reduction in branch mispredictions of up to 5%

Edinburgh Research Archive

Hybrid branch prediction for pipelined MIPS processor

Author: Al-Khalid Ali S.
Omran Safaa S.
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/08/2020
Field of study

In the modern microprocessors that designed with pipeline stages, the performance of these types of processors will be affected when executing branch instructions, because in this case there will be stalls in the pipeline. In turn this causes in reducing the Cycle Per Instruction (CPI) of the processor. In the case of executing a branch instruction, the processor needs an extra clocks to know if that branch will happen (Taken) or not (Not Taken) and also it requires calculating the new address in the case of the branch is Taken. The prediction that the branch is T / NT is an important stage in enhancing the processor performance. In this research more than one method of branch prediction (hybrid) is used and the designed circuit will choose different types of prediction algoritms depending on the type of the branch. Some of these methods were used are static while the other are dynamic. All circuits were built practically and examined by applying different programs on the designed predictor algorithm to compute the performance of the processor

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Reducing complexity of processor front ends with static analysis and selective preloading

Author: Verma Santhosh
Publication venue: LSU Digital Commons
Publication date: 01/01/2011
Field of study

General purpose processors were once designed with the major goal of maximizing performance. As power consumption has grown, with the advent of multi-core processors and the rising importance of embedded and mobile devices, the importance of designing efficient and low cost architectures has increased. This dissertation focuses on reducing the complexity of the front end of the processor, mainly branch predictors. Branch predictors have also been designed with a focus on improving prediction accuracy so that performance is maximized. To accomplish this, the predictors proposed in the literature and used in real systems have become increasingly complex and larger, a trend that is inconsistent with the anticipated trend of simpler and more numerous cores in future processors. Much of the increased complexity in many recently proposed predictors is used to select a part of history most correlated to a branch. This makes them costly, if not impossible to implement practically. We suggest that the complex decisions do not have to be made in hardware at prediction or run time and can be moved offline. High accuracy can be achieved by making complex prediction decisions in a one-time profile run instead of using complex hardware. We apply these techniques to Spotlight, our own low cost, low complexity branch predictor. A static analysis step determines, for each branch, the history segment yielding the highest accuracy. This information is placed in unused instruction space. Spotlight achieves higher accuracy than other implementation-simple predictors such as Gshare and YAGS and matches or outperforms the two complex neural predictors that we compare it to. To ensure timely access, we evaluate using a hardware table (called a BIT) to store profile bits after they are extracted from instructions, and the accuracy of using this table. The drawback of a BIT is its size. We introduce a novel technique, Preloading that places data for an instruction in prior blocks on the path to the instruction. By doing so, it is able to significantly reduce the size of the BIT needed for good performance. We discuss other applications of Preloading on the front end other than branch predictors

Louisiana State University

Applying Perceptrons to Speculation in Computer Architecture

Author: Black Michael David
Publication venue
Publication date: 05/04/2007
Field of study

Speculation plays an ever-increasing role in optimizing the execution of programs in computer architecture. Speculative decision-makers are typically required to have high speed and small size, thus limiting their complexity and capability. Because of these restrictions, predictors often consider only a small subset of the available data in making decisions, and consequently do not realize their potential accuracy. Perceptrons, or simple neural networks, can be highly useful in speculation for their ability to examine larger quantities of available data, and identify which data lead to accurate results. Recent research has demonstrated that perceptrons can operate successfully within the strict size and latency restrictions of speculation in computer architecture. This dissertation first studies how perceptrons can be made to predict accurately when they directly replace the traditional pattern table predictor. Several weight training methods and multiple-bit perceptron topologies are modeled and evaluated in their ability to learn data patterns that pattern tables can learn. The effects of interference between past data on perceptrons are evaluated, and different interference reduction strategies are explored. Perceptrons are then applied to two speculative applications: data value prediction and dataflow critical path prediction. Several new perceptron value predictors are proposed that can consider longer or more varied data histories than existing table-based value predictors. These include a global-based local predictor that uses global correlations between data values to predict past local values, a global-based global predictor that uses global correlations to predict past global values, and a bitwise predictor that can use global correlations to generate new data values. Several new perceptron criticality predictors are proposed that use global correlations between instruction behaviors to accurately determine whether instructions lie on the critical path. These predictors are evaluated against local table-based approaches on a custom cycle-accurate processor simulator, and are shown on average to have both superior accuracy and higher instruction-per-cycle performance. Finally, the perceptron predictors are simulated using the different weight training approaches and multiple-bit topologies. It is shown that for these applications, perceptron topologies and training approaches must be selected that respond well to highly imbalanced and poorly correlated past data patterns

Digital Repository at the University of Maryland

Design of a distributed memory unit for clustered microarchitectures

Author: Bieschewski Stefan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2013
Field of study

Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Recommended from our members

Effective Performance Analysis and Debugging

Author: Curtsinger Charles M
Publication venue: ScholarWorks@UMass Amherst
Publication date: 13/07/2016
Field of study

Performance is once again a first-class concern. Developers can no longer wait for the next generation of processors to automatically optimize their software. Unfortunately, existing techniques for performance analysis and debugging cannot cope with complex modern hardware, concurrent software, or latency-sensitive software services. While processor speeds have remained constant, increasing transistor counts have allowed architects to increase processor complexity. This complexity often improves performance, but the benefits can be brittle; small changes to a program’s code, inputs, or execution environment can dramatically change performance, resulting in unpredictable performance in deployed software and complicating performance evaluation and debugging. Developers seeking to improve performance must resort to manual performance tuning for large performance gains. Software profilers are meant to guide developers to important code, but conventional profilers do not produce actionable information for concurrent applications. These profilers report where a program spends its time, not where optimizations will yield performance improvements. Furthermore, latency is a critical measure of performance for software services and interactive applications, but conventional profilers measure only throughput. Many performance issues appear only when a system is under high load, but generating this load in development is often impossible. Developers need to identify and mitigate scalability issues before deploying software, but existing tools offer developers little or no assistance. In this dissertation, I introduce an empirically-driven approach to performance analysis and debugging. I present three systems for performance analysis and debugging. Stabilizer mitigates the performance variability that is inherent in modern processors, enabling both predictable performance in deployment and statistically sound performance evaluation. Coz conducts performance experiments using virtual speedups to create the effect of an optimization in a running application. This approach accurately predicts the effect of hypothetical optimizations, guiding developers to code where optimizations will have the largest effect. Amp allows developers to evaluate system scalability using load amplification to create the effect of high load in a testing environment. In combination, Amp and Coz allow developers to pinpoint code where manual optimizations will improve the scalability of their software

ScholarWorks@UMass Amherst

Summarizing multiprocessor program execution with versatile, microarchitecture-independent snapshots

Author: Barr Kenneth C. (Kenneth Charles), 1978-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2006
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 131-137).Computer architects rely heavily on software simulation to evaluate, refine, and validate new designs before they are implemented. However, simulation time continues to increase as computers become more complex and multicore designs become more common. This thesis investigates software structures and algorithms for quickly simulating modern cache-coherent multiprocessors by amortizing the time spent to simulate the memory system and branch predictors. The Memory Timestamp Record (MTR) summarizes the directory and cache state of a multiprocessor system in a compact data structure. A single MTR snapshot is versatile enough to reconstruct the microarchitectural state resulting from various coherence protocols and cache organizations. The MTR may be quickly updated by each simulated processor during a fast-forwarding phase and optionally stored off-line for reuse. To fill large branch prediction tables, we introduce Branch Predictor-based Compression (BPC) which compactly stores a branch trace so that it may be used to fill in any branch predictor structure. An entire BPC trace requires less space than single discrete predictor snapshots, and it may be decompressed 3-6x faster than performing functional simulation.by Kenneth C. Barr.Ph.D

DSpace@MIT

Improving processor efficiency by exploiting common-case behaviors of memory instructions

Author: Subramaniam Samantika
Publication venue: Georgia Institute of Technology
Publication date: 02/01/2009
Field of study

Processor efficiency can be described with the help of a number of desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency, and show how to improve efficiency by exploiting common-case or predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns, predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry hardware in the processor with no loss in performance. We then move to studying instruction criticality to improve processor efficiency. We study the behavior of critical load instructions and propose applications that can be optimized using predictable, load-criticality information. Finally, we explore conventional techniques for allocation and deallocation of critical structures that process memory instructions and propose new techniques to optimize the same. Our new designs have the potential to reduce the power and the area required by processors significantly without losing performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

Scholarly Materials And Research @ Georgia Tech