141 research outputs found

    Predicated execution and register windows for out-of-order processors

    Get PDF
    ISA extensions are a very powerful approach to implement new hardware techniques that require or benefit from compiler support: decisions made at compile time can be complemented at runtime, achieving a synergistic effect between the compiler and the processor. This thesis is focused on two ISA extensions: predicate execution and register windows. Predicate execution is exploited by the if-conversion compiler technique. If-conversion removes control dependences by transforming them to data dependences, which helps to exploit ILP beyond a single basic-block. Register windows help to reduce the amount of loads and stores required to save and restore registers across procedure calls by storing multiple contexts into a large architectural register file.In-order processors specially benefit from using both ISA extensions to overcome the limitations that control dependences and memory hierarchy impose on static scheduling. Predicate execution allows to move control dependence instructions past branches. Register windows reduce the amount of memory operations across procedure calls. Although if-conversion and register windows techniques have not been exclusively developed for in-order processors, their use for out-of-order processors has been studied very little. In this thesis we show that the uses of if-conversion and register windows introduce new performance opportunities and new challenges to face in out-of-order processors.The use of if-conversion in out-of-order processors helps to eliminate hard-to-predict branches, alleviating the severe performance penalties caused by branch mispredictions. However, the removal of some conditional branches by if-conversion may adversely affect the predictability of the remaining branches, because it may reduce the amount of correlation information available to the branch predictor. Moreover, predicate execution in out-of-order processors has to deal with two performance issues. First, multiple definitions of the same logical register can be merged into a single control flow, where each definition is guarded with a different predicate. Second, instructions whose guarding predicate evaluates to false consume unnecessary resources. This thesis proposes a branch prediction scheme based on predicate prediction that solves the three problems mentioned above. This scheme, which is built on top of a predicated ISA that implement a compare-and-branch model such as the one considered in this thesis, has two advantages: First, the branch accuracy is improved because the correlation information is not lost after if-conversion and the mechanism we propose permits using the computed value of the branch predicate when available, achieving 100% of accuracy. Second it avoids the predicate out-of-order execution problems.Regarding register windows, we propose a mechanism that reduces physical register requirements of an out-of-order processor to the bare minimum with almost no performance loss. The mechanism is based on identifying which architectural registers are in use by current in-flight instructions. The registers which are not in use, i.e. there is no in-flight instruction that references them, can be early released.In this thesis we propose a very efficient and low-cost hardware implementation of predicate execution and register windows that provide important benefits to out-of-order processors

    Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networks

    Get PDF
    ร‚ยฉ 2012 Mitrevski and Kotevski, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fluid Stochastic Petri Nets: From Fluid Atoms in ILP Processor Pipelines to Fluid Atoms in P2P Streaming Networ

    Methods and systems for ordering instructions using future values

    Get PDF
    A method of ordering instructions. The method can include placing a first instruction that consumes a value of an object before a second instruction that produces the value of the object such that the first instruction is processed before the second instruction and a physical location is allocated to the value of the object upon processing the first instruction.https://digitalcommons.mtu.edu/patents/1022/thumbnail.jp

    Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era

    Get PDF
    Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way. One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions. This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads. Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)

    Doctor of Philosophy

    Get PDF
    dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

    ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ์—์„œ์˜ ํšจ์œจ์ ์ธ ์กฐ๊ฑด์‹คํ–‰ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2013. 8. ์ตœ๊ธฐ์˜.์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ํ”„๋กœ๊ทธ๋žจ์„ ๋‚ด์žฅํ˜• ์‹œ์Šคํ…œ์—์„œ ๊ฐ€์†์‹œํ‚ค๋Š” ๋ฐ ์ ํ•ฉํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์—ฐ์‚ฐ์œ ๋‹›๋“ค๊ณผ ํ•˜๋‚˜์˜ ์ปจํŠธ๋กค๋Ÿฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ๊ณ ์„ฑ๋Šฅ, ์œ ์—ฐ์„ฑ, ์ €์ „๋ ฅ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋งŽ์€ ์—ฐ์‚ฐ์œ ๋‹›์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋Š” ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ์‹คํ–‰์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•˜๋ฉฐ, ์žฌ๊ตฌ์„ฑ ๊ธฐ๋Šฅ์€ ๋‹ค์–‘ํ•œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์—์˜ ํ™œ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€๋‹ค. ๋˜ํ•œ, ๋ช…๋ น์–ด์™€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์Šค์ผ€์ฅด์„ ๋ฏธ๋ฆฌ ์ •ํ•ด๋†“์Œ์œผ๋กœ์จ ์ œ์–ด๊ตฌ์กฐ๋ฅผ ๋‹จ์ˆœํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ์—ฐ์‚ฐ๋Ÿ‰ ๋Œ€๋น„ ์ „๋ ฅ์†Œ๋ชจ๋ฅผ ์ตœ์†Œํ•œ์œผ ๋กœ ์ค„์—ฌ์ค€๋‹ค. ํ•˜์ง€๋งŒ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์ด ๋ณต์žกํ•ด์ง์— ๋”ฐ๋ผ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ๋ถ€๋ถ„๋“ค์— ๋ถ„๊ธฐ๋ฌธ์ด ์ƒ๊ธฐ๊ฒŒ ๋˜์—ˆ์œผ๋ฉฐ ์ด๋Š” ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ์— ์žˆ์–ด ํฐ ์œ„ํ˜‘์ด ๋˜๊ณ  ์žˆ๋‹ค. ๋ถ„๊ธฐ๋ฌธ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ํ•˜๋‚˜์ด๊ธฐ ๋•Œ๋ฌธ์— ์ปจํŠธ๋กค๋Ÿฌ์— ๋ณ‘๋ชฉํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๊ฑฐ๋‚˜ ๋™์‹œ์— ์„œ๋กœ ๋‹ค๋ฅธ ์ œ์–ด๋ฅผ ์š”๊ตฌํ•˜๊ฒŒ ๋˜๋ฉด ํ•ด๋‹น ํ”„๋กœ๊ทธ๋žจ์€ ๊ฐ€์†์ด ๋ถˆ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. ์กฐ๊ฑด์‹คํ–‰์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ด๋ฅผ ๋ถ€๋ถ„์ ์œผ๋กœ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธฐ์กด์— ๊ฐœ๋ฐœ๋˜์–ด ์žˆ๋Š” ์กฐ๊ฑด์‹คํ–‰ ๊ธฐ์ˆ ๋“ค์€ ์žฌ๊ตฌ์„ฑํ˜• ๊ตฌ์กฐ์— ์„ฑ๋Šฅ ๋ฐ ์ „๋ ฅ์†Œ๋ชจ ๋ฉด์—์„œ ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ๋ผ์นœ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์ง€๋งŒ ๋ถ„๊ธฐ๋ฌธ์„ ๊ฐ€์ง„ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์—์„œ ์กฐ๊ฑด์‹คํ–‰์ด ์„ฑ๋Šฅ๊ณผ ์ „๋ ฅ ๋ฉด์—์„œ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ฐํžˆ๋ฉฐ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ณ ์„ฑ๋Šฅ๊ณผ ์ €์ „๋ ฅ์„ ๊ฐ€์ง„ ์กฐ๊ฑด์‹คํ–‰ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด ์ œ์•ˆํ•œ ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ ์„ธ๊ฐ€์ง€ ๋ฐฉ์‹๋ณด๋‹ค ์„ฑ๋Šฅ๊ณผ ์ „๋ ฅ์†Œ๋ชจ๋ฅผ ๊ณฑ์œผ๋กœ ํ‘œํ˜„ํ•œ ์ˆ˜์น˜์— ์žˆ์–ด์„œ 11.9%, 14.7%, 23.8% ๋งŒํผ์˜ ์ด๋“์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ, ์ œ์•ˆํ•œ ์กฐ๊ฑด์‹คํ–‰ ๋ฐฉ๋ฒ•์— ์ ํ•ฉํ•œ ์ปดํŒŒ์ผ ์ฒด๊ณ„๋„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์กฐ๊ฑด์‹คํ–‰์€ ์ ˆ์ „๋ชจ๋“œ๋ฅผ ์‚ฌ์šฉํ•จ์— ๋”ฐ๋ผ ์ „๋ ฅ์„ ์•„๋‚„ ์ˆ˜ ์žˆ์ง€๋งŒ ๊ธฐ์กด์˜ ์ปดํŒŒ์ผ๋ฐฉ์‹์œผ๋กœ๋Š” ์—ฌ๋Ÿฌ ์กฐ๊ฑด๋ฌธ์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์ปดํŒŒ์ผํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๋ฐํžˆ๊ณ  ์กฐ๊ฑด๋ฌธ๋“ค์„ ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์œ ๋‹›์— ํ• ๋‹นํ•จ์œผ๋กœ์จ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค. ์ œ์•ˆํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋‹จ์ˆœํ•˜๊ณ  ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•์— ๋น„ํ•˜์—ฌ ํ‰๊ท ์ ์œผ๋กœ 2.21๋ฐฐ์˜ ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.Coarse-Grained Reconfigurable Architecture (CGRA) is one of viable solutions in embedded systems to accelerate data-intensive applications. It typically consists of an array of processing elements (PEs) and a centralized controller, which can provide high performance, flexibility, and low power. Parallel array processing reduces execution time of applications, reconfigurability of PEs allows changing its functionality, and simplified control structure with static scheduling for instruction fetching and data communication minimizes power consumption. However, as applications become complex so that data-intensive parts are having control flows in them, CGRAs face a challenge for its effectiveness. Since the entire PEs are controlled by a centralized unit, it is impossible to execute programs having control divergence among PEs. To overcome the problem, we can adopt the technique called predicated execution, which is the unique solution known so far, but conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching/decoding/nullifying steps. Thus, this thesis reveals performance and power issues in predicated execution when a CGRA executes both data- and control-intensive applications, which have not been well-addressed yet. Then it proposes high-performance and low-power predication mechanisms. Experiments conducted through gate-level simulation show that the proposed mechanism improves energy-delay product by 11.9%, 14.7%, and 23.8% compared to three conventional techniques. In addition, this thesis also reveals mapping issues when mapping applications on CGRAs using the proposed predication. A power-saving mode introduced into PEs prohibits multiple conditionals from being parallelized if conventional mapping algorithms are used. Thus, this thesis proposes the framework to release this problem by mapping conditionals to different PEs. Experiments show that mapping results from the proposed approach lead to 2.21 times higher performance than those of the naรฏve approach.Abstract i Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 Coarse-Grained Reconfigurable Architecture . . . . . . . . . . . . 5 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Target Domain . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Comparison with Other Architectures . . . . . . . . . . . 6 2.1.4 Application Mapping . . . . . . . . . . . . . . . . . . . . . 8 2.1.5 Target CGRA . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Predicated Execution Technique . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Different Roles in ILP and DLP processors . . . . . . . . 13 2.2.4 Predication Support on CGRAs . . . . . . . . . . . . . . . 14 Chapter 3 Conventional Predicated Execution Techniques 15 3.1 Partial Predication (Partial) . . . . . . . . . . . . . . . . . . . . 16 3.2 Condition-Based Full Predication (CondFull) . . . . . . . . . . 18 Chapter 4 State-Based Full Predication 23 4.1 Previous Approach (PseudoBranch) . . . . . . . . . . . . . . . 24 4.2 Counter-Based Approach (StateFull) . . . . . . . . . . . . . . 25 4.3 Dual-Issue-Single-Execution (DISE) . . . . . . . . . . . . . . . . 28 4.4 Hybrid Predication . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 StateFull+Partial . . . . . . . . . . . . . . . . . . . . 34 4.4.3 StateFull+Partial+DISE . . . . . . . . . . . . . . . . 35 Chapter 5 Evaluation 39 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Conventional Techniques . . . . . . . . . . . . . . . . . . . 39 5.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Effect of Predication Mechanism on Power Consumption of a PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Quantitative Definitions of short-if and long-if . . . . . . 48 5.3.3 Compilation Strategy in StateFull+Partial . . . . . . 48 5.3.4 Conventional Techniques (Partial, CondFull, and PseudoBranch) vs. Proposed StateFull Technique . . . . . 49 5.3.5 Proposed Hybrid Predication Techniques . . . . . . . . . 53 5.3.6 Putting Together . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.7 Speedup of Applications . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Mapping Framework 61 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 From IR to CDFG . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.4 CDFG Mapping . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 69 6.4.2 Verification of Mapping Framework . . . . . . . . . . . . . 70 6.4.3 Quality of Mapping Results . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusion 73 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Applicable Scope and Future Work . . . . . . . . . . . . . . . . . 75 Appendix 77 ๊ตญ๋ฌธ์ดˆ๋ก 93 ๊ฐ์‚ฌ์˜ ๊ธ€ 95Docto

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
    • โ€ฆ
    corecore