96 research outputs found

    Software trace cache

    Get PDF
    We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version

    Ultra low power cooperative branch prediction

    Get PDF
    Branch Prediction is a key task in the operation of a high performance processor. An inaccurate branch predictor results in increased program run-time and a rise in energy consumption. The drive towards processors with limited die-space and tighter energy requirements will continue to intensify over the coming years, as will the shift towards increasingly multicore processors. Both trends make it increasingly important and increasingly difficult to find effective and efficient branch predictor designs. This thesis presents savings in energy and die-space through the use of more efficient cooperative branch predictors achieved through novel branch prediction designs. The first contribution is a new take on the problem of a hybrid dynamic-static branch predictor allocating branches to be predicted by one of its sub-predictors. A new bias parameter is introduced as a mechanism for trading off a small amount of performance for savings in die-space and energy. This is achieved by predicting more branches with the static predictor, ensuring that only the branches that will most benefit from the dynamic predictor’s resources are predicted dynamically. This reduces pressure on the dynamic predictor’s resources allowing for a smaller predictor to achieve very high accuracy. An improvement in run-time of 7-8% over the baseline BTFN predictor is observed at a cost of a branch predictor bits budget of much less than 1KB. Next, a novel approach to branch prediction for multicore data-parallel applications is presented. The Peloton branch prediction scheme uses a pack of cyclists as an illustration of how a group of processors running similar tasks can share branch predictions to improve accuracy and reduce runtime. The results show that sharing updates for conditional branches across the existing interconnect for I-cache and D-cache updates results in a reduction of mispredictions of up to 25% and a reduction in run-time of up to 6%. McPAT is used to present an energy model that suggests the savings are achieved at little to no increase in energy required. The technique is then extended to architectures where the size of the branch predictors may differ between cores. The results show that such heterogeneity can dramatically reduce the die-space required for an accurate branch predictor while having little impact on performance and up to 9% energy savings. The approach can be combined with the Peloton branch prediction scheme for reduction in branch mispredictions of up to 5%

    Exploiting intra-function correlation with the global history stack

    Get PDF
    Abstract. The demand for more computation power in high-end embedded systems has put embedded processors on parallel evolution track as the RISC processors. Caches and deeper pipelines are standard features on recent embedded microprocessors. As a result of this, some of the performance penalties associated with branch instructions in RISC processors are becoming more prevalent in these processors. As is the case in RISC architectures, designers have turned to dynamic branch prediction to alleviate this problem. Global correlating branch predictors take advantage of the influence past branches have on future ones. The conditional branch outcomes are recorded in a global history register (GHR). Based on the hypothesis that most correlation is among intra-function branches, we provide a detailed analysis of the Global History Stack (GHS) in this paper. The GHS saves the global history in the return address stack when a call instruction is executed. Following the subsequent return, the history is restored from the stack. In addition, to preserve the correlation between the callee branches and the caller branches following the call instruction, we save a few of the history bits coming from the end of the callee's execution. We also investigate saving the GHR of a function in the Branch Target Buffer (BTB) when it returns so that it can be restored when that function is called again. Our results show that these techniques improve the accuracy of several global history based prediction schemes by 4% on average. Consequently, performance improvements as high as 13% are attained

    A case for (partially) tagged geometric history length branch prediction

    Get PDF
    International audienceIt is now widely admitted that in order to provide state-of-the-art accuracy, a conditional branch predictor must combine several predictions. Recent research has shown that an adder tree is a very effective approach for the prediction combination function. In this paper, we present a more cost effective solution for this prediction combination function for predictors relying on several predictor components indexed with different history lengths. Using geometric history length as the O-GEHL predictor, the TAGE predictor uses (partially) tagged components as the PPM-like predictor. TAGE relies on (partial) hit-miss detection as the prediction computation function. TAGE provides state-of-the-art prediction accuracy on conditional branches. In particular, at equivalent storage budgets, the TAGE predictor significantly outperforms all the predictors that were presented at the Championship Branch Prediction in december 2004. The accuracy of the prediction of the targets of indirect branches is a major issue on some applications. We show that the principles of the TAGE predictor can be directly applied to the prediction of indirect branches. The ITTAGE predictor (Indirect Target TAgged GEometric history length) significantly outperforms previous state-of-the-art indirect target branch predictors. Both TAGE and ITTAGE predictors feature tagged predictor components indexed with distinct history lengths forming a geometric series. They can be associated in a single cost-effective predictor, sharing tables and predictor logic, the COTTAGE predictor (COnditional and indirect Target TAgged GEometric history length)

    Improving the O-GEHL branch prediction accuracy using analytical results

    Get PDF
    The O-GEHL branch predictor has outperformed other prediction schemes using the same set of benchmarks in an international branch prediction contest, CBP-1. In this paper, we present the analysis results on each of the OGEHL branch predictor tables and also on the optimal number of predictor tables. Two methods are subsequently proposed to help increase the O-GEHL prediction accuracy. The first one aims to increase the space utilization of the first predictor table by dynamically adjusting the lengths of branch history regarding to the type of a benchmark currently in execution. The second one adds an extra table into the O-GEHL predictor using the space saved from the sharing of hysteresis bits. Experimental results have confirmed that both schemes improve the accuracy of two different predictor configurations, leading to two promising research directions for future explorations.Facultad de Informátic

    Principled Approaches to Last-Level Cache Management

    Get PDF
    Memory is a critical component of all computing systems. It represents a fundamental performance and energy bottleneck. Ideally, memory aspects such as energy cost, performance, and the cost of implementing management techniques would scale together with the size of all different computing systems; unfortunately this is not the case. With the upcoming trends in applications, new memory technologies, etc., scaling becomes a bigger a problem, aggravating the performance bottleneck that memory represents. A memory hierarchy was proposed to alleviate the problem. Each level in the hierarchy tends to have a decreasing cost per bit, an increased capacity, and a higher access time compared to its previous level. Preferably all data will be stored in the fastest level of memory, unfortunately, faster memory technologies tend to be associated with a higher manufacturing cost, which often limits their capacity. The design challenge is, to determine which is the frequently used data, and store it in the faster levels of memory. A cache is a small, fast, on-chip chunk of memory. Any data stored in main memory can be stored in the cache. For many programs, a typical behavior is to access data that has been accessed previously. Taking advantage of this behavior, a copy of frequently accessed data is kept in the cache, in order to provide a faster access time next time is requested. Due to capacity constrains, it is likely that all of the frequently reused data cannot fit in the cache, because of this, cache management policies decide which data is to be kept in the cache, and which in other levels of the memory hierarchy. Under an efficient cache management policy, an encouraging amount of memory requests will be serviced from a fast on-chip cache. The disparity in access latency between the last-level cache and main memory motivates the search for efficient cache management policies. There is a great amount of recently proposed work that strives to utilize cache capacity in the most favorable to performance way possible. Related work focus on optimizing the performance of caches focusing on different possible solutions, e.g. reduce miss rate, consume less power, reducing storage overhead, reduce access latency, etc. Our work focus on improving the performance of last-level caches by designing policies based on principles adapted from other areas of interest. In this dissertation, we focus on several aspects of cache management policies, we first introduce a space-efficient placement and promotion policy which goal is to minimize the updates to the replacement policy state on each cache access. We further introduce a mechanism that predicts whether a block in the cache will be reused, it feeds different features from a block to the predictor in order to increase the correlation of a previous access to a future access. We later introduce a technique that tweaks traditional cache indexing, providing fast accesses to a vast majority of requests in the presence of a slow access memory technology such as DRAM

    Herramienta para la simulación y visualización de procesadores superscalares

    Get PDF
    Nuestro proyecto ha consistido en la elaboración de predictores de saltos para usar en el simulador Simplescalar. Simplescalar es una potente herramienta que permite la simulación de un procesador superescalar, desde distintos puntos de vista. Uno de estos puntos es la predicción de saltos, algo fundamental para el buen funcionamiento y rendimiento de un microprocesador. El simulador se encuentra dividido en varios módulos, cada uno de ellos tiene el código abierto, por lo que se permite su modificación para así poder asegurar a los investigadores que puedan probar con comodidad aquello en lo que estén interesados. El lenguaje de este código es C, y está estructurado y modulado de tal forma que se permiten hacer cambios con relativa facilidad. [ABSTRACT]Our project deals with the elaboration of branch predictors to use them with the Simpescalar simulator. Simplescalar is a powerful tool that allows a superscalar processor’s simulation from several viewpoints. One of these points is branch prediction, which is basic to have a good work and performance in a microprocessor. The simulator is divided into several modules, each one of them with their open code. That is the reason why it’s permitted its modification in order to make sure to investigators a comfortable testing of the topics they are interested in. The language of this code is C, and it is structured and moduled in such form that allows to make changes with relative easiness

    Design of a distributed memory unit for clustered microarchitectures

    Get PDF
    Power constraints led to the end of exponential growth in single–processor performance, which characterized the semiconductor industry for many years. Single–chip multiprocessors allowed the performance growth to continue so far. Yet, Amdahl’s law asserts that the overall performance of future single–chip multiprocessors will depend crucially on single–processor performance. In a multiprocessor a small growth in single–processor performance can justify the use of significant resources. Partitioning the layout of critical components can improve the energy–efficiency and ultimately the performance of a single processor. In a clustered microarchitecture parts of these components form clusters. Instructions are processed locally in the clusters and benefit from the smaller size and complexity of the clusters components. Because the clusters together process a single instruction stream communications between clusters are necessary and introduce an additional cost. This thesis proposes the design of a distributed memory unit and first level cache in the context of a clustered microarchitecture. While the partitioning of other parts of the microarchitecture has been well studied the distribution of the memory unit and the cache has received comparatively little attention. The first proposal consists of a set of cache bank predictors. Eight different predictor designs are compared based on cost and accuracy. The second proposal is the distributed memory unit. The load and store queues are split into smaller queues for distributed disambiguation. The mapping of memory instructions to cache banks is delayed until addresses have been calculated. We show how disambiguation can be implemented efficiently with unordered queues. A bank predictor is used to map instructions that consume memory data near the data origin. We show that this organization significantly reduces both energy usage and latency. The third proposal introduces Dispatch Throttling and Pre-Access Queues. These mechanisms avoid load/store queue overflows that are a result of the late allocation of entries. The fourth proposal introduces Memory Issue Queues, which add functionality to select instructions for execution and re-execution to the memory unit. The fifth proposal introduces Conservative Deadlock Aware Entry Allocation. This mechanism is a deadlock safe issue policy for the Memory Issue Queues. Deadlocks can result from certain queue allocations because entries are allocated out-of-order instead of in-order like in traditional architectures. The sixth proposal is the Early Release of Load Queue Entries. Architectures with weak memory ordering such as Alpha, PowerPC or ARMv7 can take advantage of this mechanism to release load queue entries before the commit stage. Together, these proposals allow significantly smaller and more energy efficient load queues without the need of energy hungry recovery mechanisms and without performance penalties. Finally, we present a detailed study that compares the proposed distributed memory unit to a centralized memory unit and confirms its advantages of reduced energy usage and of improved performance

    Reducing complexity of processor front ends with static analysis and selective preloading

    Get PDF
    General purpose processors were once designed with the major goal of maximizing performance. As power consumption has grown, with the advent of multi-core processors and the rising importance of embedded and mobile devices, the importance of designing efficient and low cost architectures has increased. This dissertation focuses on reducing the complexity of the front end of the processor, mainly branch predictors. Branch predictors have also been designed with a focus on improving prediction accuracy so that performance is maximized. To accomplish this, the predictors proposed in the literature and used in real systems have become increasingly complex and larger, a trend that is inconsistent with the anticipated trend of simpler and more numerous cores in future processors. Much of the increased complexity in many recently proposed predictors is used to select a part of history most correlated to a branch. This makes them costly, if not impossible to implement practically. We suggest that the complex decisions do not have to be made in hardware at prediction or run time and can be moved offline. High accuracy can be achieved by making complex prediction decisions in a one-time profile run instead of using complex hardware. We apply these techniques to Spotlight, our own low cost, low complexity branch predictor. A static analysis step determines, for each branch, the history segment yielding the highest accuracy. This information is placed in unused instruction space. Spotlight achieves higher accuracy than other implementation-simple predictors such as Gshare and YAGS and matches or outperforms the two complex neural predictors that we compare it to. To ensure timely access, we evaluate using a hardware table (called a BIT) to store profile bits after they are extracted from instructions, and the accuracy of using this table. The drawback of a BIT is its size. We introduce a novel technique, Preloading that places data for an instruction in prior blocks on the path to the instruction. By doing so, it is able to significantly reduce the size of the BIT needed for good performance. We discuss other applications of Preloading on the front end other than branch predictors

    Dynamic Data Dependence Tracking and Its Application to Branch Prediction

    Get PDF
    To continue to improve processor performance, microarchitects seek to increase the effective instruction level parallelism (ILP) that can be exploited in applications. A fundamental limit to improving ILP is data dependences among instructions. If data dependence information is available at run-time, there are many uses to improve ILP. Prior published examples include decoupled branch exectuion architectures and critical instruction detection. In this paper, we describe an efficient hardware mechanism to dynamically track the data dependence chains of the instructions in the pipeline. This information is available on a cycle-by-cycle basis to the microengine for optimizing its perfromance. We then use this design in a new value-based branch prediction design using Available Register Value Information (ARVI). From the use of data dependence information, the ARVI branch predictor has better prediction accuracy over a comparably sized hybrid branch perdictor. With ARVI used as the second-level branch predictor, the improved prediction accuracy results in a 12.6% performance improvement on average across the SPEC95 integer benchmark suite
    • …
    corecore