6 research outputs found

    Code optimizations for narrow bitwidth architectures

    Get PDF
    This thesis takes a HW/SW collaborative approach to tackle the problem of computational inefficiency in a holistic manner. The hardware is redesigned by restraining the datapath to merely 16-bit datawidth (integer datapath only) to provide an extremely simple, low-cost, low-complexity execution core which is best at executing the most common case efficiently. This redesign, referred to as the Narrow Bitwidth Architecture, is unique in that although the datapath is squeezed to 16-bits, it continues to offer the advantage of higher memory addressability like the contemporary wider datapath architectures. Its interface to the outside (software) world is termed as the Narrow ISA. The software is responsible for efficiently mapping the current stack of 64-bit applications onto the 16-bit hardware. However, this HW/SW approach introduces a non-negligible penalty both in dynamic code-size and performance-impact even with a reasonably smart code-translator that maps the 64- bit applications on to the 16-bit processor. The goal of this thesis is to design a software layer that harnesses the power of compiler optimizations to assuage this negative performance penalty of the Narrow ISA. More specifically, this thesis focuses on compiler optimizations targeting the problem of how to compile a 64-bit program to a 16-bit datapath machine from the perspective of Minimum Required Computations (MRC). Given a program, the notion of MRC aims to infer how much computation is really required to generate the same (correct) output as the original program. Approaching perfect MRC is an intrinsically ambitious goal and it requires oracle predictions of program behavior. Towards this end, the thesis proposes three heuristic-based optimizations to closely infer the MRC. The perspective of MRC unfolds into a definition of productiveness - if a computation does not alter the storage location, it is non-productive and hence, not necessary to be performed. In this research, the definition of productiveness has been applied to different granularities of the data-flow as well as control-flow of the programs. Three profile-based, code optimization techniques have been proposed : 1. Global Productiveness Propagation (GPP) which applies the concept of productiveness at the granularity of a function. 2. Local Productiveness Pruning (LPP) applies the same concept but at a much finer granularity of a single instruction. 3. Minimal Branch Computation (MBC) is an profile-based, code-reordering optimization technique which applies the principles of MRC for conditional branches. The primary aim of all these techniques is to reduce the dynamic code footprint of the Narrow ISA. The first two optimizations (GPP and LPP) perform the task of speculatively pruning the non-productive (useless) computations using profiles. Further, these two optimization techniques perform backward traversal of the optimization regions to embed checks into the nonspeculative slices, hence, making them self-sufficient to detect mis-speculation dynamically. The MBC optimization is a use case of a broader concept of a lazy computation model. The idea behind MBC is to reorder the backslices containing narrow computations such that the minimal necessary computations to generate the same (correct) output are performed in the most-frequent case; the rest of the computations are performed only when necessary. With the proposed optimizations, it can be concluded that there do exist ways to smartly compile a 64-bit application to a 16- bit ISA such that the overheads are considerably reduced.Esta tesis deriva su motivación en la inherente ineficiencia computacional de los procesadores actuales: a pesar de que muchas aplicaciones contemporáneas tienen unos requisitos de ancho de bits estrechos (aplicaciones de enteros, de red y multimedia), el hardware acaba utilizando el camino de datos completo, utilizando más recursos de los necesarios y consumiendo más energía. Esta tesis utiliza una aproximación HW/SW para atacar, de forma íntegra, el problema de la ineficiencia computacional. El hardware se ha rediseñado para restringir el ancho de bits del camino de datos a sólo 16 bits (únicamente el de enteros) y ofrecer así un núcleo de ejecución simple, de bajo consumo y baja complejidad, el cual está diseñado para ejecutar de forma eficiente el caso común. El rediseño, llamado en esta tesis Arquitectura de Ancho de Bits Estrecho (narrow bitwidth en inglés), es único en el sentido que aunque el camino de datos se ha estrechado a 16 bits, el sistema continúa ofreciendo las ventajas de direccionar grandes cantidades de memoria tal como procesadores con caminos de datos más anchos (64 bits actualmente). Su interface con el mundo exterior se denomina ISA estrecho. En nuestra propuesta el software es responsable de mapear eficientemente la actual pila software de las aplicaciones de 64 bits en el hardware de 16 bits. Sin embargo, esta aproximación HW/SW introduce penalizaciones no despreciables tanto en el tamaño del código dinámico como en el rendimiento, incluso con un traductor de código inteligente que mapea las aplicaciones de 64 bits en el procesador de 16 bits. El objetivo de esta tesis es el de diseñar una capa software que aproveche la capacidad de las optimizaciones para reducir el efecto negativo en el rendimiento del ISA estrecho. Concretamente, esta tesis se centra en optimizaciones que tratan el problema de como compilar programas de 64 bits para una máquina de 16 bits desde la perspectiva de las Mínimas Computaciones Requeridas (MRC en inglés). Dado un programa, la noción de MRC intenta deducir la cantidad de cómputo que realmente se necesita para generar la misma (correcta) salida que el programa original. Aproximarse al MRC perfecto es una meta intrínsecamente ambiciosa y que requiere predicciones perfectas de comportamiento del programa. Con este fin, la tesis propone tres heurísticas basadas en optimizaciones que tratan de inferir el MRC. La utilización de MRC se desarrolla en la definición de productividad: si un cálculo no altera el dato que ya había almacenado, entonces no es productivo y por lo tanto, no es necesario llevarlo a cabo. Se han propuesto tres optimizaciones del código basadas en profile: 1. Propagación Global de la Productividad (GPP en inglés) aplica el concepto de productividad a la granularidad de función. 2. Poda Local de Productividad (LPP en inglés) aplica el mismo concepto pero a una granularidad mucho más fina, la de una única instrucción. 3. Computación Mínima del Salto (MBC en inglés) es una técnica de reordenación de código que aplica los principios de MRC a los saltos condicionales. El objetivo principal de todas esta técnicas es el de reducir el tamaño dinámico del código estrecho. Las primeras dos optimizaciones (GPP y LPP) realizan la tarea de podar especulativamente las computaciones no productivas (innecesarias) utilizando profiles. Además, estas dos optimizaciones realizan un recorrido hacia atrás de las regiones a optimizar para añadir chequeos en el código no especulativo, haciendo de esta forma la técnica autosuficiente para detectar, dinámicamente, los casos de fallo en la especulación. La idea de la optimización MBC es reordenar las instrucciones que generan el salto condicional tal que las mínimas computaciones que general la misma (correcta) salida se ejecuten en la mayoría de los casos; el resto de las computaciones se ejecutarán sólo cuando sea necesario

    Profile based optimization techniques for large scale applications

    Get PDF
    制度:新 ; 文部省報告番号:乙2095号 ; 学位の種類:博士(工学) ; 授与年月日:2007/3/24 ; 早大学位記番号:新454

    Improving Performance by Branch Reordering

    No full text
    The conditional branch has long been considered an expensive operation. The relative cost of conditional branches has increased as recently designed machines are now relying on deeper pipelines and higher multiple issue. Reducing the number of conditional branches executed can often result in a substantial performance benefit. This paper describes a code-improving transformation to reorder sequences of conditional branches. First, sequences of branches that can be reordered are detected in the control flow. Second, profiling information is collected to predict the probability that each branch will transfer control out of the sequence. Third, the cost of performing each conditional branch is estimated. Fourth, the most beneficial ordering of the branches based on the estimated probability and cost is selected. The most beneficial ordering often included the insertion of additional conditional branches that did not previously exist in the sequence. Finally, the control flow isrestructured to reflect the new ordering. The results of applying the transformation were significant reductions in the dynamic number of instructions and branches, as well as decreases in execution time

    Improving Performance By Branch Reordering

    No full text
    ix 1 INTRODUCTION 1 2 RELATED WORK 6 3 DETECTING A SEQUENCE OF REORDERABLE BRANCHES 9 3.1 Detecting a Sequence of Reorderable Branches with a Common Successor : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 3.2 Detecting a Sequence of Reorderable Range Conditions Comparing a Common Variable to Constants : : : : : : : : : : : : : : : 15 4 HANDLING SIDE EFFECTS IN A COMMON VARIABLE SEQUENCE 22 5 PERFORMING PROFILING 28 5.1 Producing Profile Information for Common Successor Sequence : 28 5.2 Producing Profile Information for Common Variable Sequence : 33 iv 6 SELECTING THE ORDERING OF BRANCHES 37 6.1 Selecting the Order of a Sequence of Branches with Common Successors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38 6.2 Selecting the Order of a Sequence of Range Conditions Comparing a Common Variable : : : : : : : : : : : : : : : : : : : : : : : 40 7 IMPROVING THE SELECTED SEQUENCE OF RANGE CONDITIONS 46 8 APPLYING THE REORDERING TRANSFORMATION 49 9 RES..

    Improving performance by branch reordering

    No full text
    The conditional branch has long been considered an expensive operation. The relative cost of conditional branches has increased as recently designed machines are now relying on deeper pipelines and higher multiple issue. Reducing the number of conditional branches executed can often result in a substantial performance benefit. This paper describes a code-improving transformation to reorder sequences of conditional branches. First, sequences of branches that can be reordered are detected in the control flow. Second, profiling information is collected to predict the probability that each branch will transfer control out of the sequence. Third, the cost of performing each conditional branch is estimated. Fourth, the most beneficial ordering of the branches based on the estimated probability and cost is selected. The most beneficial ordering often included the insertion of additional conditional branches that did not previously exist in the sequence. Finally, the control flow is restructured to reflect the new ordering. The results of applying the transformation were significant reductions in the dynamic number of instructions and branches, as well as decreases in execution time. 1

    Improving performance by branch reordering

    No full text
    corecore