19 research outputs found

    Survey on Combinatorial Register Allocation and Instruction Scheduling

    Full text link
    Register allocation (mapping variables to processor registers or memory) and instruction scheduling (reordering instructions to increase instruction-level parallelism) are essential tasks for generating efficient assembly code in a compiler. In the last three decades, combinatorial optimization has emerged as an alternative to traditional, heuristic algorithms for these two tasks. Combinatorial optimization approaches can deliver optimal solutions according to a model, can precisely capture trade-offs between conflicting decisions, and are more flexible at the expense of increased compilation time. This paper provides an exhaustive literature review and a classification of combinatorial optimization approaches to register allocation and instruction scheduling, with a focus on the techniques that are most applied in this context: integer programming, constraint programming, partitioned Boolean quadratic programming, and enumeration. Researchers in compilers and combinatorial optimization can benefit from identifying developments, trends, and challenges in the area; compiler practitioners may discern opportunities and grasp the potential benefit of applying combinatorial optimization

    Compilation techniques for short-vector instructions

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 127-133).Multimedia extensions are nearly ubiquitous in today's general-purpose processors. These extensions consist primarily of a set of short-vector instructions that apply the same opcode to a vector of operands. This design introduces a data-parallel component to processors that exploit instruction-level parallelism, and presents an opportunity for increased performance. In fact, ignoring a processor's vector opcodes can leave a significant portion of the available resources unused. In order for software developers to find short-vector instructions generally useful, the compiler must target these extensions with complete transparency and consistent performance. This thesis develops compiler techniques to target short-vector instructions automatically and efficiently. One important aspect of compilation is the effective management of memory alignment. As with scalar loads and stores, vector references are typically more efficient when accessing aligned regions. In many cases, the compiler can glean no alignment information and must emit conservative code sequences. In response, I introduce a range of compiler techniques for detecting and enforcing aligned references. In my benchmark suite, the most practical method ensures alignment for roughly 75% of dynamic memory references.(cont.) This thesis also introduces selective vectorization, a technique for balancing computation across a processor's scalar and vector resources. Current approaches for targeting short-vector instructions directly adopt vectorizing technology first developed for supercomputers. Traditional vectorization, however, can lead to a performance degradation since it fails to account for a processor's scalar execution resources. I formulate selective vectorization in the context of software pipelining. My approach creates software pipelines with shorter initiation intervals, and therefore, higher performance. In contrast to conventional methods, selective vectorization operates on a low-level intermediate representation. This technique allows the algorithm to accurately measure the performance trade-offs of code selection alternatives. A key aspect of selective vectorization is its ability to manage communication of operands between vector and scalar instructions. Even when operand transfer is expensive, the technique is sufficiently sophisticated to achieve significant performance gains. I evaluate selective vectorization on a set of SPEC FP benchmarks. On a realistic VLIW processor model, the approach achieves whole-program speedups of up to 1.35x over existing approaches. For individual loops, it provides speedups of up to 1.75x.by Samuel Larsen.Ph.D

    Optimal Global Instruction Scheduling for the Itanium® Processor Architecture

    Get PDF
    On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen Einfluß auf die Leistung und stellt dabei gleichzeitig eine Herausforderung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig voneinander abhängigen Entscheidungen verbunden, die für Heuristiken nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht neben schnellerem Code auch Einblicke in das Potential der Itanium- Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschiebungen mit Kompensationscode, Prädikation und Itanium-spezifische Techniken wie Kontroll- und Datenspekulation. Bei ganzzahliger linearer Programmierung sind wohlstrukturierte Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig, kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme globaler Instruktionsanordnung durch Auslassung bestimmter Klassen von Nebenbedingungen und beweisen, daß die korrespondierenden Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells, die auch durch moderate Lösungszeiten bestätigt wird. Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung erweitert; letztere bezeichnet das Verschieben von Befehlen aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer Programmierung erzeugt. Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint 2000 erreicht der Optimierer eine signifikante Beschleunigung gegenüber dem Originalcode

    Optimal Global Instruction Scheduling for the Itanium® Processor Architecture

    Get PDF
    On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen Einfluß auf die Leistung und stellt dabei gleichzeitig eine Herausforderung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig voneinander abhängigen Entscheidungen verbunden, die für Heuristiken nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht neben schnellerem Code auch Einblicke in das Potential der Itanium- Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschiebungen mit Kompensationscode, Prädikation und Itanium-spezifische Techniken wie Kontroll- und Datenspekulation. Bei ganzzahliger linearer Programmierung sind wohlstrukturierte Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig, kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme globaler Instruktionsanordnung durch Auslassung bestimmter Klassen von Nebenbedingungen und beweisen, daß die korrespondierenden Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells, die auch durch moderate Lösungszeiten bestätigt wird. Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung erweitert; letztere bezeichnet das Verschieben von Befehlen aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer Programmierung erzeugt. Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint 2000 erreicht der Optimierer eine signifikante Beschleunigung gegenüber dem Originalcode

    Iterative Compilation and Performance Prediction for Numerical Applications

    Get PDF
    Institute for Computing Systems ArchitectureAs the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of profiling to find the best possible program variant, have been developed. These strategies have been evaluated using a range of kernels and programs on different platforms and operating systems. A significant performance improvement has been achieved using new approaches when compared to the state-of-the-art native static and platform-specific feedback directed compilers

    Tiled microprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 251-258).Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONs in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them.(cont.) To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 nm technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism.by Michael Bedford Taylor.Ph.D

    An Ultra-Low-Energy, Variation-Tolerant FPGA Architecture Using Component-Specific Mapping

    Get PDF
    As feature sizes scale toward atomic limits, parameter variation continues to increase, leading to increased margins in both delay and energy. Parameter variation both slows down devices and causes devices to fail. For applications that require high performance, the possibility of very slow devices on critical paths forces designers to reduce clock speed in order to meet timing. For an important and emerging class of applications that target energy-minimal operation at the cost of delay, the impact of variation-induced defects at very low voltages mandates the sizing up of transistors and operation at higher voltages to maintain functionality. With post-fabrication configurability, FPGAs have the opportunity to self-measure the impact of variation, determining the speed and functionality of each individual resource. Given that information, a delay-aware router can use slow devices on non-critical paths, fast devices on critical paths, and avoid known defects. By mapping each component individually and customizing designs to a component's unique physical characteristics, we demonstrate that we can eliminate delay margins and reduce energy margins caused by variation. To quantify the potential benefit we might gain from component-specific mapping, we first measure the margins associated with parameter variation, and then focus primarily on the energy benefits of FPGA delay-aware routing over a wide range of predictive technologies (45 nm--12 nm) for the Toronto20 benchmark set. We show that relative to delay-oblivious routing, delay-aware routing without any significant optimizations can reduce minimum energy/operation by 1.72x at 22 nm. We demonstrate how to construct an FPGA architecture specifically tailored to further increase the minimum energy savings of component-specific mapping by using the following techniques: power gating, gate sizing, interconnect sparing, and LUT remapping. With all optimizations considered we show a minimum energy/operation savings of 2.66x at 22 nm, or 1.68--2.95x when considered across 45--12 nm. As there are many challenges to measuring resource delays and mapping per chip, we discuss methods that may make component-specific mapping more practical. We demonstrate that a simpler, defect-aware routing achieves 70% of the energy savings of delay-aware routing. Finally, we show that without variation tolerance, scaling from 16 nm to 12 nm results in a net increase in minimum energy/operation; component-specific mapping, however, can extend minimum energy/operation scaling to 12 nm and possibly beyond.</p

    Affordable kilo-instruction processors

    Get PDF
    Diversos motius expliquen l'estancament en el que es troba el desenvolupament del processador tradicional dissenyat per maximitzar el rendiment d'un únic fil d'execució. Per una banda, técniques agressives com la supersegmentacó del camí de dades o l'execució fora d'ordre tenen un impacte molt negatiu sobre el consum de potència i la complexitat del disseny. Altrament, l'increment en la freqüència del processador augmenta la discrepància entre la velocitat del processador i el temps d'accés a memòria principal. Tot i que les memòries cau redueixen considerablement el nombre d'accessos a memòria principal, aquests accessos introdueixen latencies prou grans per reduir considerablement el rendiment. Tècniques convencionals com l'execució fora d'ordre, útils per ocultar accessos a les memòries cau de 2on nivell, no estan pensades per ocultar latències tan grans. Caldrien cues amb mides de centenars d'instruccions i milers de registres per tal de no interrompre l'execució en el moment de produir-se un accés a memòria principal. Desafortunadament, la tecnologia disponible no és eficient per implementar aquestes estructures monolíticament, doncs resultaria un temps d'accés molt elevat, un consum de potència igualment elevat i un àrea no menyspreable. En aquesta tesi s'han estudiat tècniques que permeten l'implementació d'un processador amb capacitat per continuar processant instruccions en el cas de que es produeixin accessos a memòria principal. Les condicions per a que aquest processador sigui implementable són que estigui basat en estructures de mida convencional i que tingui una unitat de control senzilla. El repte es troba en conciliar un model de processador distribuït amb un control senzill. El problema del disseny del processador s'ha enfocat observant el comportament d'un processador de recursos infinits. S'ha observat que l'execució segueix uns patrons molt interessants, basats en la localitat d'execució. En aplicacions numèriques s'observa que més del 70% de les instruccions no depenen de accessos a memòria principal. Aixó és molt important doncs mostra que sempre hi ha una porció important d'instruccions executables poc després de la decodificació. Aixó permet proposar un nou tipus de processador amb dues unitats d'execució. La primera unitat (el "Cache Processor") processa a alta velocitat instruccions independents de memòria principal. La segona unitat ("Memory Processor") processa les instruccions dependents de accessos a memòria principal, pero de forma molt més relaxada, cosa que li permet mantenir milers de instruccions en vol. Aquesta proposta rep el nom de Decoupled KILO-Instruction Processor (D-KIP) i té forces avantatges: per un costat permet la construcció d'un kilo-instruction processor basat en estructures convencionals i per l'altre simplifica el disseny ja que minimitza les interaccions entre ambdos unitats d'execució.En aquesta tesi es proposen dos implementacions de processadors desacoblats: el D-KIP original, i el Flexible Heterogeneous MultiCore (FMC). Sobre aquestes propostes s'analitza el rendiment i es compara amb altres tècniques que incrementan el parallelisme de memoria, com el prefetching o l'execució "runahead". D'aquesta avaluació es desprén que el processador FMC té un rendiment similar al de un processador convencional amb una finestra de 1500 instruccions en vol. Posteriorment s'analitza l'integració del FMC en entorns multicore/multiprogrammats. La tesi es completa amb la proposta d'una cua de loads i stores (LSQ) per a aquest tipus de processador.Several motives explain the slowdown of high-performance single-thread processor development. On the one hand, aggressive techniques such as superpipelining or out-of-order execution have a considerable impact on power consumption and design complexity. On the other hand, the increment in processor frequencies has led to a large disparity between processor speed and memory access time. Although cache memories considerably reduce the number of accesses to main memory, the remaining accesses introduce latencies large enough to considerably decrease performance. Conventional techniques such as out-of-order execution, while effective in hiding L2 cache accesses, cannot hide latencies this large. Queues of hundreds of entries and thousands of registers would be necessary in order to prevent execution from stalling in the event of a L2 cache miss. Unfortunately, current technology cannot efficiently implement such structures monolithically, as access latencies would considerably increase, as would power consumption and area consumption.In this thesis we studied techniques that allow the processor to continue processing instructions in the event of main memory accesses. The conditions for such a processor to be implementable are that it should be based on structures of conventional size and that it should feature simple control logic. The challenge lies in being able to design a distributed processor with simple control. The design of this processor has been approached by analyzing the behavior of a processor with infinite resources. We have observed that execution follows a very interesting pattern based on execution locality. In numerical codes we observed that over 70% of all instructions do not depend on memory accesses. This is interesting since it shows that there is always a large portion of instructions that can be executed shortly after decode. This allows us to propose a new kind of processor with two execution units. The first unit, the Cache Processor, processes memory-independent instructions at high speed. The second unit, the Memory Processor, processes instructions that depend on main memory accesses, but using relaxed scheduling logic, which allows it to scale to thousands of in-flight instructions. This proposal, which receives the name of Decoupled KILO-Instruction Processor (D-KIP), has several advantages. On the one hand it allows the construction of a kilo-instruction processor based on conventional structures and, on the other hand, it simplifies the design as the interaction between both execution units is minimal. In this thesis two implementations for this kind of processor are presented: the original D-KIP and the Flexible Heterogeneous MultiCore (FMC). The performance of these proposals is analyzed and compared to other proposals that increase memory-level parallelism, such as prefetching or runahead execution. It is observed that the FMC processor performs at the same level of a conventional processor with a window of around 1500 instructions. Further, the integration of the FMC processor into a multicore/multiprogrammed environment is studied. This thesis concludes with the proposal of a two-level Load/Store Queue for this kind of processor

    Reducing exception management overhead with software restart markers

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 181-196).Modern processors rely on exception handling mechanisms to detect errors and to implement various features such as virtual memory. However, these mechanisms are typically hardware-intensive because of the need to buffer partially-completed instructions to implement precise exceptions and enforce in-order instruction commit, often leading to issues with performance and energy efficiency. The situation is exacerbated in highly parallel machines with large quantities of programmer-visible state, such as VLIW or vector processors. As architects increasingly rely on parallel architectures to achieve higher performance, the problem of exception handling is becoming critical. In this thesis, I present software restart markers as the foundation of an exception handling mechanism for explicitly parallel architectures. With this model, the compiler is responsible for delimiting regions of idempotent code. If an exception occurs, the operating system will resume execution from the beginning of the region. One advantage of this approach is that instruction results can be committed to architectural state in any order within a region, eliminating the need to buffer those values. Enabling out-of-order commit can substantially reduce the exception management overhead found in precise exception implementations, and enable the use of new architectural features that might be prohibitively costly with conventional precise exception implementations. Additionally, software restart markers can be used to reduce context switch overhead in a multiprogrammed environment. This thesis demonstrates the applicability of software restart markers to vector, VLIW, and multithreaded architectures. It also contains an implementation of this exception handling approach that uses the Trimaran compiler infrastructure to target the Scale vectorthread architecture. I show that using software restart markers incurs very little performance overhead for vector-style execution on Scale.(cont.) Finally, I describe the Scale compiler flow developed as part of this work and discuss how it targets certain features facilitated by the use of software restart markersby Mark Jerome Hampton.Ph.D
    corecore