1,828 research outputs found

    Predicated execution and register windows for out-of-order processors

    Get PDF
    ISA extensions are a very powerful approach to implement new hardware techniques that require or benefit from compiler support: decisions made at compile time can be complemented at runtime, achieving a synergistic effect between the compiler and the processor. This thesis is focused on two ISA extensions: predicate execution and register windows. Predicate execution is exploited by the if-conversion compiler technique. If-conversion removes control dependences by transforming them to data dependences, which helps to exploit ILP beyond a single basic-block. Register windows help to reduce the amount of loads and stores required to save and restore registers across procedure calls by storing multiple contexts into a large architectural register file.In-order processors specially benefit from using both ISA extensions to overcome the limitations that control dependences and memory hierarchy impose on static scheduling. Predicate execution allows to move control dependence instructions past branches. Register windows reduce the amount of memory operations across procedure calls. Although if-conversion and register windows techniques have not been exclusively developed for in-order processors, their use for out-of-order processors has been studied very little. In this thesis we show that the uses of if-conversion and register windows introduce new performance opportunities and new challenges to face in out-of-order processors.The use of if-conversion in out-of-order processors helps to eliminate hard-to-predict branches, alleviating the severe performance penalties caused by branch mispredictions. However, the removal of some conditional branches by if-conversion may adversely affect the predictability of the remaining branches, because it may reduce the amount of correlation information available to the branch predictor. Moreover, predicate execution in out-of-order processors has to deal with two performance issues. First, multiple definitions of the same logical register can be merged into a single control flow, where each definition is guarded with a different predicate. Second, instructions whose guarding predicate evaluates to false consume unnecessary resources. This thesis proposes a branch prediction scheme based on predicate prediction that solves the three problems mentioned above. This scheme, which is built on top of a predicated ISA that implement a compare-and-branch model such as the one considered in this thesis, has two advantages: First, the branch accuracy is improved because the correlation information is not lost after if-conversion and the mechanism we propose permits using the computed value of the branch predicate when available, achieving 100% of accuracy. Second it avoids the predicate out-of-order execution problems.Regarding register windows, we propose a mechanism that reduces physical register requirements of an out-of-order processor to the bare minimum with almost no performance loss. The mechanism is based on identifying which architectural registers are in use by current in-flight instructions. The registers which are not in use, i.e. there is no in-flight instruction that references them, can be early released.In this thesis we propose a very efficient and low-cost hardware implementation of predicate execution and register windows that provide important benefits to out-of-order processors

    Aggressive Memory Speculation in HW/SW Co-Designed Machines

    Get PDF
    International audienceSingle-ISA heterogeneous systems (such as ARM big.LITTLE) are an attractive solution for embedded platforms as they expose performance/energy trade-offs directly to the operating system. Recent works have demonstrated the ability to increase their efficiency by using VLIW cores, supported through Dynamic Binary Translation (DBT) to maintain the illusion of a single-ISA system. However, VLIW cores cannot rival with Outof- Order (OoO) cores when it comes to performance, mainly because they do not use speculative execution. In this work, we study how it is possible to use memory dependency speculation during the DBT process. Our approach enables fine-grained speculation optimizations thanks to a combination of hardware and software. Our results show that our approach leads to a geo-mean speed-up of 10% at the price of a 7% area overhead

    IR-Level Versus Machine-Level If-Conversion for Predicated Architectures

    Get PDF
    If-conversion is a simple yet powerful optimization that converts control dependences into data dependences. It allows elimination of branches and increases available instruction level parallelism and thus overall performance. If-conversion can either be applied alone or in combination with other techniques that increase the size of scheduling regions. The presence of hardware support for predicated execution allows if-conversion to be broadly applied in a given program. This makes it necessary to guide the optimization using heuristic estimates regarding its potential benefit. Similar to other transformations in an optimizing compiler, if-conversion inherently su↵ers from phase ordering issues. Driven by these facts, we developed two algorithms for if-conversion targeting the TI TMS320C64x+ architecture within the LLVM framework. Each implementation targets a di↵erent level of code abstraction. While one targets the intermediate representation, the other addresses machine-level code. Both make use of an adapted set of estimation heuristics and prove to be successful in general, but each one exhibits di↵erent strengths and weaknesses. High-level if-conversion, applied before other control flow transformations, has more freedom to operate. But in contrast to its machine-level counterpart, which is more restricted, its estimations of runtime are less accurate. Our results from experimental evaluation show a mean speedup close to 14 % for both algorithms on a set of programs from the MiBench and DSPstone benchmark suites. We give a comparison of the implemented optimizations and discuss gained insights on the topics of ifconversion, phase ordering issues and profitability analysis

    Optimal Global Instruction Scheduling for the Itanium® Processor Architecture

    Get PDF
    On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen Einfluß auf die Leistung und stellt dabei gleichzeitig eine Herausforderung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig voneinander abhängigen Entscheidungen verbunden, die für Heuristiken nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht neben schnellerem Code auch Einblicke in das Potential der Itanium- Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschiebungen mit Kompensationscode, Prädikation und Itanium-spezifische Techniken wie Kontroll- und Datenspekulation. Bei ganzzahliger linearer Programmierung sind wohlstrukturierte Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig, kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme globaler Instruktionsanordnung durch Auslassung bestimmter Klassen von Nebenbedingungen und beweisen, daß die korrespondierenden Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells, die auch durch moderate Lösungszeiten bestätigt wird. Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung erweitert; letztere bezeichnet das Verschieben von Befehlen aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer Programmierung erzeugt. Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint 2000 erreicht der Optimierer eine signifikante Beschleunigung gegenüber dem Originalcode

    재구성형 구조에서의 효율적인 조건실행 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 최기영.재구성형 구조는 연산량이 많은 프로그램을 내장형 시스템에서 가속시키는 데 적합한 방법 중 하나이다. 이는 일반적으로 많은 연산유닛들과 하나의 컨트롤러로 구성되어 고성능, 유연성, 저전력을 동시에 달성할 수 있도록 해준다. 많은 연산유닛을 바탕으로 한 병렬처리는 응용프로그램의 실행속도를 빠르게 하며, 재구성 기능은 다양한 응용프로그램에의 활용을 가능하게 해준다. 또한, 명령어와 데이터에 대한 스케쥴을 미리 정해놓음으로써 제어구조를 단순화시킬 수 있으며 이는 연산량 대비 전력소모를 최소한으 로 줄여준다. 하지만 응용프로그램이 복잡해짐에 따라 연산량이 많은 부분들에 분기문이 생기게 되었으며 이는 재구성형 구조를 사용함에 있어 큰 위협이 되고 있다. 분기문을 다룰 수 있는 컨트롤러가 하나이기 때문에 컨트롤러에 병목현상이 발생하거나 동시에 서로 다른 제어를 요구하게 되면 해당 프로그램은 가속이 불가능해진다. 조건실행이라는 기술을 사용할 경우 이를 부분적으로 해소할 수 있지만 기존에 개발되어 있는 조건실행 기술들은 재구성형 구조에 성능 및 전력소모 면에서 부정적인 영향을 끼친다. 따라서 본 논문에서는 연산량이 많지만 분기문을 가진 응용프로그램에서 조건실행이 성능과 전력 면에서 어떠한 영향을 미치는지 밝히며 이를 바탕으로 고성능과 저전력을 가진 조건실행 방법을 제안한다. 실험 결과에 따르면 제안한 방식은 기존의 세가지 방식보다 성능과 전력소모를 곱으로 표현한 수치에 있어서 11.9%, 14.7%, 23.8% 만큼의 이득을 보였다. 또한, 제안한 조건실행 방법에 적합한 컴파일 체계도 제안하였다. 제안한 조건실행은 절전모드를 사용함에 따라 전력을 아낄 수 있지만 기존의 컴파일방식으로는 여러 조건문을 병렬적으로 수행하도록 컴파일할 수 없는 문제가 생긴다. 따라서 본 논문에서는 이런 문제를 밝히고 조건문들을 서로 다른 연산유닛에 할당함으로써 문제를 해결하는 방식을 제안하고 있다. 제안한 방식을 사용할 경우 단순하고 직관적인 방법에 비하여 평균적으로 2.21배의 높은 성능을 얻을 수 있었다.Coarse-Grained Reconfigurable Architecture (CGRA) is one of viable solutions in embedded systems to accelerate data-intensive applications. It typically consists of an array of processing elements (PEs) and a centralized controller, which can provide high performance, flexibility, and low power. Parallel array processing reduces execution time of applications, reconfigurability of PEs allows changing its functionality, and simplified control structure with static scheduling for instruction fetching and data communication minimizes power consumption. However, as applications become complex so that data-intensive parts are having control flows in them, CGRAs face a challenge for its effectiveness. Since the entire PEs are controlled by a centralized unit, it is impossible to execute programs having control divergence among PEs. To overcome the problem, we can adopt the technique called predicated execution, which is the unique solution known so far, but conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching/decoding/nullifying steps. Thus, this thesis reveals performance and power issues in predicated execution when a CGRA executes both data- and control-intensive applications, which have not been well-addressed yet. Then it proposes high-performance and low-power predication mechanisms. Experiments conducted through gate-level simulation show that the proposed mechanism improves energy-delay product by 11.9%, 14.7%, and 23.8% compared to three conventional techniques. In addition, this thesis also reveals mapping issues when mapping applications on CGRAs using the proposed predication. A power-saving mode introduced into PEs prohibits multiple conditionals from being parallelized if conventional mapping algorithms are used. Thus, this thesis proposes the framework to release this problem by mapping conditionals to different PEs. Experiments show that mapping results from the proposed approach lead to 2.21 times higher performance than those of the naïve approach.Abstract i Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 Coarse-Grained Reconfigurable Architecture . . . . . . . . . . . . 5 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Target Domain . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Comparison with Other Architectures . . . . . . . . . . . 6 2.1.4 Application Mapping . . . . . . . . . . . . . . . . . . . . . 8 2.1.5 Target CGRA . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Predicated Execution Technique . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Different Roles in ILP and DLP processors . . . . . . . . 13 2.2.4 Predication Support on CGRAs . . . . . . . . . . . . . . . 14 Chapter 3 Conventional Predicated Execution Techniques 15 3.1 Partial Predication (Partial) . . . . . . . . . . . . . . . . . . . . 16 3.2 Condition-Based Full Predication (CondFull) . . . . . . . . . . 18 Chapter 4 State-Based Full Predication 23 4.1 Previous Approach (PseudoBranch) . . . . . . . . . . . . . . . 24 4.2 Counter-Based Approach (StateFull) . . . . . . . . . . . . . . 25 4.3 Dual-Issue-Single-Execution (DISE) . . . . . . . . . . . . . . . . 28 4.4 Hybrid Predication . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 StateFull+Partial . . . . . . . . . . . . . . . . . . . . 34 4.4.3 StateFull+Partial+DISE . . . . . . . . . . . . . . . . 35 Chapter 5 Evaluation 39 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Conventional Techniques . . . . . . . . . . . . . . . . . . . 39 5.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Effect of Predication Mechanism on Power Consumption of a PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Quantitative Definitions of short-if and long-if . . . . . . 48 5.3.3 Compilation Strategy in StateFull+Partial . . . . . . 48 5.3.4 Conventional Techniques (Partial, CondFull, and PseudoBranch) vs. Proposed StateFull Technique . . . . . 49 5.3.5 Proposed Hybrid Predication Techniques . . . . . . . . . 53 5.3.6 Putting Together . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.7 Speedup of Applications . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Mapping Framework 61 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 From IR to CDFG . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.4 CDFG Mapping . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 69 6.4.2 Verification of Mapping Framework . . . . . . . . . . . . . 70 6.4.3 Quality of Mapping Results . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusion 73 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Applicable Scope and Future Work . . . . . . . . . . . . . . . . . 75 Appendix 77 국문초록 93 감사의 글 95Docto

    Optimal Global Instruction Scheduling for the Itanium® Processor Architecture

    Get PDF
    On the Itanium 2 processor, effective global instruction scheduling is crucial to high performance. At the same time, it poses a challenge to the compiler: This code generation subtask involves strongly interdependent decisions and complex trade-offs that are difficult to cope with for heuristics. We tackle this NP-complete problem with integer linear programming (ILP), a search-based method that yields provably optimal results. This promises faster code as well as insights into the potential of the architecture. Our ILP model comprises global code motion with compensation copies, predication, and Itanium-specific features like control/data speculation. In integer linear programming, well-structured models are the key to acceptable solution times. The feasible solutions of an ILP are represented by integer points inside a polytope. If all vertices of this polytope are integral, then the ILP can be solved in polynomial time. We define two subproblems of global scheduling in which some constraint classes are omitted and show that the corresponding two subpolytopes of our ILP model are integral and polynomial sized. This substantiates that the found model is of high efficiency, which is also confirmed by the reasonable solution times. The ILP formulation is extended by further transformations like cyclic code motion, which moves instructions upwards out of a loop, circularly in the opposite direction of the loop backedges. Since the architecture requires instructions to be encoded in fixed-sized bundles of three, a bundler is developed that computes bundle sequences of minimal size by means of precomputed results and dynamic programming. Experiments have been conducted with a postpass tool that implements the ILP scheduler. It parses assembly procedures generated by Intel�s Itanium compiler and reschedules them as a whole. Using this tool, we optimize a selection of hot functions from the SPECint 2000 benchmark. The results show a significant speedup over the original code.Globale Instruktionsanordnung hat beim Itanium-2-Prozessor großen Einfluß auf die Leistung und stellt dabei gleichzeitig eine Herausforderung für den Compiler dar: Sie ist mit zahlreichen komplexen, wechselseitig voneinander abhängigen Entscheidungen verbunden, die für Heuristiken nur schwer zu beherrschen sind.Wir lösen diesesNP-vollständige Problem mit ganzzahliger linearer Programmierung (ILP), einer suchbasierten Methode mit beweisbar optimalen Ergebnissen. Das ermöglicht neben schnellerem Code auch Einblicke in das Potential der Itanium- Prozessorarchitektur. Unser ILP-Modell umfaßt globale Codeverschiebungen mit Kompensationscode, Prädikation und Itanium-spezifische Techniken wie Kontroll- und Datenspekulation. Bei ganzzahliger linearer Programmierung sind wohlstrukturierte Modelle der Schlüssel zu akzeptablen Lösungszeiten. Die zulässigen Lösungen eines ILPs werden durch ganzzahlige Punkte innerhalb eines Polytops repräsentiert. Sind die Eckpunkte dieses Polytops ganzzahlig, kann das ILP in Polynomialzeit gelöst werden. Wir definieren zwei Teilprobleme globaler Instruktionsanordnung durch Auslassung bestimmter Klassen von Nebenbedingungen und beweisen, daß die korrespondierenden Teilpolytope unseres ILP-Modells ganzzahlig und von polynomieller Größe sind. Dies untermauert die hohe Effizienz des gefundenen Modells, die auch durch moderate Lösungszeiten bestätigt wird. Das ILP-Modell wird um weitere Transformationen wie zyklische Codeverschiebung erweitert; letztere bezeichnet das Verschieben von Befehlen aufwärts aus einer Schleife heraus, in Gegenrichtung ihrer Rückwärtskanten. Da die Architektur eine Kodierung der Befehle in Dreierbündeln fester Größe vorschreibt, wird ein Bundler entwickelt, der Bündelsequenzen minimaler Länge mit Hilfe vorberechneter Teilergebnisse und dynamischer Programmierung erzeugt. Für die Experimente wurde ein Postpassoptimierer erstellt. Er liest von Intels Itanium-Compiler erzeugte Assemblerroutinen ein und ordnet die enthaltenen Instruktionen mit Hilfe der ILP-Methode neu an. Angewandt auf eine Auswahl von Funktionen aus dem Benchmark SPECint 2000 erreicht der Optimierer eine signifikante Beschleunigung gegenüber dem Originalcode

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
    corecore