230 research outputs found

    The weakening of branch predictor performance as an inevitable side effect of exploiting control independence

    Get PDF
    Many algorithms are inherently sequential and hard to explicitly parallelize. Cores designed to aggressively handle these problems exhibit deeper pipelines and wider fetch widths to exploit instruction-level parallelism via out-of-order execution. As these parameters increase, so does the amount of instructions fetched along an incorrect path when a branch is mispredicted. Many of the instructions squashed after a branch are control independent, meaning they will be fetched regardless of whether the candidate branch is taken or not. There has been much research in retaining these control independent instructions on misprediction of the candidate branch. This research shows that there is potential for exploiting control independence since under favorable circumstances many benchmarks can exhibit 30% or more speedup. Though these control independent processors are meant to lessen the damage of misprediction, an inherent side-effect of fetching out of order, branch weakening, keeps realized speedup from reaching its potential. This thesis introduces, formally defines, and identifies the types of branch weakening. Useful information is provided to develop techniques that may reduce weakening. A classification is provided that measures each type of weakening to help better determine potential speedup of control independence processors. Experimentation shows that certain applications suffer greatly from weakening. Total branch mispredictions increase by 30% in several cases. Analysis has revealed two broad causes of weakening: changes in branch predictor update times and changes in the outcome history used by branch predictors. Each of these broad causes are classified into more specific causes, one of which is due to the loss of nearby correlation data and cannot be avoided. The classification technique presented in this study measures that 45% of the weakening in the selected SPEC CPU 2000 benchmarks are of this type while 40% involve other changes in outcome history. The remaining 15% is caused by changes in predictor update times. In applying fundamental techniques that reduce weakening, the Control Independence Aware Branch Predictor is developed. This predictor reduces weakening for the majority of chosen benchmarks. In doing so, a control independence processor, snipper, to attain significantly higher speedup for 10 out of 15 studied benchmarks

    Controlling speculative execution through a virtually ordered memory system

    Get PDF
    Processors which extract parallelism through speculative execution must be able to identify when mis-speculation has occurred. The three places where mis-speculation can occur are register accesses, control flow prediction and memory accesses. Controlling register and control flow speculation has been well studied, but no scalable techniques for identifying memory dependence violations have been identified. Since speculative execution occurs out of order this requires tracking the causal order, as well as the addresses of memory accesses. This thesis uses simulations to investigate tracking the causal order of memory accesses using explicit tags known as virtual timestamps, a distributed and scalable method. Realizable virtual timestamps are necessarily restricted in length and it is demonstrated that naive allocation schemes seriously constrain execution by inefficiently allocating virtual timestamps. Efficiently allocating virtual timestamps requires analysis of the number required by each section of code. Basic statically and dynamically evaluated analysis methods are established to avoid virtual timestamp allocation becoming a resource bottleneck. The same analysis is also used to efficiently allocate state saving resources in a fixed hardware order. The hardware order provides an alternative way of maintaining the causal order using a simple hardware organization. The ability to predict the resources required by regions of code is used as a way of selecting instructions to execute speculatively. This enables resources to be allocated efficiently and is shown to allow large amounts of parallelism to be extracted. It also promotes the effectiveness of speculative execution by issuing less instructions that will ultimately be rolled back. Using a hierarchy of hardware ordering modules, themselves ordered by explicit virtual timestamps, a scalable ordering system is proposed. This hierarchy forms the basis of a twisted memory system, a multiple version memory system capable of identifying speculative memory dependence violations. The preliminary investigations presented here show that twisted memory has the potential to support aggressive speculative parallel execution. Particular attention is paid to memory bandwidth requirements

    Design of trace caches for high bandwidth instruction fetching

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (leaves 60-63).by Michael Sung.M.Eng

    Code optimizations for narrow bitwidth architectures

    Get PDF
    This thesis takes a HW/SW collaborative approach to tackle the problem of computational inefficiency in a holistic manner. The hardware is redesigned by restraining the datapath to merely 16-bit datawidth (integer datapath only) to provide an extremely simple, low-cost, low-complexity execution core which is best at executing the most common case efficiently. This redesign, referred to as the Narrow Bitwidth Architecture, is unique in that although the datapath is squeezed to 16-bits, it continues to offer the advantage of higher memory addressability like the contemporary wider datapath architectures. Its interface to the outside (software) world is termed as the Narrow ISA. The software is responsible for efficiently mapping the current stack of 64-bit applications onto the 16-bit hardware. However, this HW/SW approach introduces a non-negligible penalty both in dynamic code-size and performance-impact even with a reasonably smart code-translator that maps the 64- bit applications on to the 16-bit processor. The goal of this thesis is to design a software layer that harnesses the power of compiler optimizations to assuage this negative performance penalty of the Narrow ISA. More specifically, this thesis focuses on compiler optimizations targeting the problem of how to compile a 64-bit program to a 16-bit datapath machine from the perspective of Minimum Required Computations (MRC). Given a program, the notion of MRC aims to infer how much computation is really required to generate the same (correct) output as the original program. Approaching perfect MRC is an intrinsically ambitious goal and it requires oracle predictions of program behavior. Towards this end, the thesis proposes three heuristic-based optimizations to closely infer the MRC. The perspective of MRC unfolds into a definition of productiveness - if a computation does not alter the storage location, it is non-productive and hence, not necessary to be performed. In this research, the definition of productiveness has been applied to different granularities of the data-flow as well as control-flow of the programs. Three profile-based, code optimization techniques have been proposed : 1. Global Productiveness Propagation (GPP) which applies the concept of productiveness at the granularity of a function. 2. Local Productiveness Pruning (LPP) applies the same concept but at a much finer granularity of a single instruction. 3. Minimal Branch Computation (MBC) is an profile-based, code-reordering optimization technique which applies the principles of MRC for conditional branches. The primary aim of all these techniques is to reduce the dynamic code footprint of the Narrow ISA. The first two optimizations (GPP and LPP) perform the task of speculatively pruning the non-productive (useless) computations using profiles. Further, these two optimization techniques perform backward traversal of the optimization regions to embed checks into the nonspeculative slices, hence, making them self-sufficient to detect mis-speculation dynamically. The MBC optimization is a use case of a broader concept of a lazy computation model. The idea behind MBC is to reorder the backslices containing narrow computations such that the minimal necessary computations to generate the same (correct) output are performed in the most-frequent case; the rest of the computations are performed only when necessary. With the proposed optimizations, it can be concluded that there do exist ways to smartly compile a 64-bit application to a 16- bit ISA such that the overheads are considerably reduced.Esta tesis deriva su motivación en la inherente ineficiencia computacional de los procesadores actuales: a pesar de que muchas aplicaciones contemporáneas tienen unos requisitos de ancho de bits estrechos (aplicaciones de enteros, de red y multimedia), el hardware acaba utilizando el camino de datos completo, utilizando más recursos de los necesarios y consumiendo más energía. Esta tesis utiliza una aproximación HW/SW para atacar, de forma íntegra, el problema de la ineficiencia computacional. El hardware se ha rediseñado para restringir el ancho de bits del camino de datos a sólo 16 bits (únicamente el de enteros) y ofrecer así un núcleo de ejecución simple, de bajo consumo y baja complejidad, el cual está diseñado para ejecutar de forma eficiente el caso común. El rediseño, llamado en esta tesis Arquitectura de Ancho de Bits Estrecho (narrow bitwidth en inglés), es único en el sentido que aunque el camino de datos se ha estrechado a 16 bits, el sistema continúa ofreciendo las ventajas de direccionar grandes cantidades de memoria tal como procesadores con caminos de datos más anchos (64 bits actualmente). Su interface con el mundo exterior se denomina ISA estrecho. En nuestra propuesta el software es responsable de mapear eficientemente la actual pila software de las aplicaciones de 64 bits en el hardware de 16 bits. Sin embargo, esta aproximación HW/SW introduce penalizaciones no despreciables tanto en el tamaño del código dinámico como en el rendimiento, incluso con un traductor de código inteligente que mapea las aplicaciones de 64 bits en el procesador de 16 bits. El objetivo de esta tesis es el de diseñar una capa software que aproveche la capacidad de las optimizaciones para reducir el efecto negativo en el rendimiento del ISA estrecho. Concretamente, esta tesis se centra en optimizaciones que tratan el problema de como compilar programas de 64 bits para una máquina de 16 bits desde la perspectiva de las Mínimas Computaciones Requeridas (MRC en inglés). Dado un programa, la noción de MRC intenta deducir la cantidad de cómputo que realmente se necesita para generar la misma (correcta) salida que el programa original. Aproximarse al MRC perfecto es una meta intrínsecamente ambiciosa y que requiere predicciones perfectas de comportamiento del programa. Con este fin, la tesis propone tres heurísticas basadas en optimizaciones que tratan de inferir el MRC. La utilización de MRC se desarrolla en la definición de productividad: si un cálculo no altera el dato que ya había almacenado, entonces no es productivo y por lo tanto, no es necesario llevarlo a cabo. Se han propuesto tres optimizaciones del código basadas en profile: 1. Propagación Global de la Productividad (GPP en inglés) aplica el concepto de productividad a la granularidad de función. 2. Poda Local de Productividad (LPP en inglés) aplica el mismo concepto pero a una granularidad mucho más fina, la de una única instrucción. 3. Computación Mínima del Salto (MBC en inglés) es una técnica de reordenación de código que aplica los principios de MRC a los saltos condicionales. El objetivo principal de todas esta técnicas es el de reducir el tamaño dinámico del código estrecho. Las primeras dos optimizaciones (GPP y LPP) realizan la tarea de podar especulativamente las computaciones no productivas (innecesarias) utilizando profiles. Además, estas dos optimizaciones realizan un recorrido hacia atrás de las regiones a optimizar para añadir chequeos en el código no especulativo, haciendo de esta forma la técnica autosuficiente para detectar, dinámicamente, los casos de fallo en la especulación. La idea de la optimización MBC es reordenar las instrucciones que generan el salto condicional tal que las mínimas computaciones que general la misma (correcta) salida se ejecuten en la mayoría de los casos; el resto de las computaciones se ejecutarán sólo cuando sea necesario

    Control-Flow Security.

    Full text link
    Computer security is a topic of paramount importance in computing today. Though enormous effort has been expended to reduce the software attack surface, vulnerabilities remain. In contemporary attacks, subverting the control-flow of an application is often the cornerstone to a successful attempt to compromise a system. This subversion, known as a control-flow attack, remains as an essential building block of many software exploits. This dissertation proposes a multi-pronged approach to securing software control-flow to harden the software attack surface. The primary domain of this dissertation is the elimination of the basic mechanism in software enabling control-flow attacks. I address the prevalence of such attacks by going to the heart of the problem, removing all of the operations that inject runtime data into program control. This novel approach, Control-Data Isolation, provides protection by subtracting the root of the problem; indirect control-flow. Previous works have attempted to address control-flow attacks by layering additional complexity in an effort to shield software from attack. In this work, I take a subtractive approach; subtracting the primary cause of both contemporary and classic control-flow attacks. This novel approach to security advances the state of the art in control-flow security by ensuring the integrity of the programmer-intended control-flow graph of an application at runtime. Further, this dissertation provides methodologies to eliminate the barriers to adoption of control-data isolation while simultaneously moving ahead to reduce future attacks. The secondary domain of this dissertation is technique which leverages the process by which software is engineered, tested, and executed to pinpoint the statements in software which are most likely to be exploited by an attacker, defined as the Dynamic Control Frontier. Rather than reacting to successful attacks by patching software, the approach in this dissertation will move ahead of the attacker and identify the susceptible code regions before they are compromised. In total, this dissertation combines software and hardware design techniques to eliminate contemporary control-flow attacks. Further, it demonstrates the efficacy and viability of a subtractive approach to software security, eliminating the elements underlying security vulnerabilities.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133304/1/warthur_1.pd

    Last-touch correlated data streaming

    Get PDF
    Recent research advocates address-correlating predictors to identify cache block addresses for prefetch. Unfortunately, address-correlating predictors require correlation data storage proportional in size to a program's active memory footprint. As a result, current proposals for this class of predictor are either limited in coverage due to constrained on-chip storage requirements or limited in prediction lookaheaddue to long off-chip correlation data lookup. In this paper, we propose Last-Touch Correlated Data Streaming (LT-cords), a practical address-correlating predictor. The key idea of LT-cords is to record correlation data off chip in the order they will be used and stream them into a practicallysized on-chip table shortly before they are needed, thereby obviating the need for scalable on-chip tables and enabling low-latency lookup. We use cycle-accurate simulation of an 8-way out-of-order superscalar processor to show that: (1) LT-cords with 214KB of on-chip storage can achieve the same coverage as a last-touch predictor with unlimited storage, without sacrificing predictor lookahead, and (2) LT-cords improves performance by 60% on average and 385% at best in the benchmarks studied. © 2007 IEEE

    Prefetching techniques for client server object-oriented database systems

    Get PDF
    The performance of many object-oriented database applications suffers from the page fetch latency which is determined by the expense of disk access. In this work we suggest several prefetching techniques to avoid, or at least to reduce, page fetch latency. In practice no prediction technique is perfect and no prefetching technique can entirely eliminate delay due to page fetch latency. Therefore we are interested in the trade-off between the level of accuracy required for obtaining good results in terms of elapsed time reduction and the processing overhead needed to achieve this level of accuracy. If prefetching accuracy is high then the total elapsed time of an application can be reduced significantly otherwise if the prefetching accuracy is low, many incorrect pages are prefetched and the extra load on the client, network, server and disks decreases the whole system performance. Access pattern of object-oriented databases are often complex and usually hard to predict accurately. The ..
    corecore