202 research outputs found

    Achieving Superscalar Performance without Superscalar Overheads - A Dataflow Compiler IR for Custom Computing

    Get PDF
    The difficulty of effectively parallelizing code for multicore processors, combined with the end of threshold voltage scaling has resulted in the problem of \u27Dark Silicon\u27, severely limiting performance scaling despite Moore\u27s Law. To address dark silicon, not only must we drastically improve the energy efficiency of computation, but due to Amdahl\u27s Law, we must do so without compromising sequential performance. Designers increasingly utilize custom hardware to dramatically improve both efficiency and performance in increasingly heterogeneous architectures. Unfortunately, while it efficiently accelerates numeric, data-parallel applications, custom hardware often exhibits poor performance on sequential code, so complex, power-hungry superscalar processors must still be utilized. This paper addresses the problem of improving sequential performance in custom hardware by (a) switching from a statically scheduled to a dynamically scheduled (dataflow) execution model, and (b) developing a new compiler IR for high-level synthesis that enables aggressive exposition of ILP even in the presence of complex control flow. This new IR is directly implemented as a static dataflow graph in hardware by our high-level synthesis tool-chain, and shows an average speedup of 1.13 times over equivalent hardware generated using LegUp, an existing HLS tool. In addition, our new IR allows us to further trade area & energy for performance, increasing the average speedup to 1.55 times, through loop unrolling, with a peak speedup of 4.05 times. Our custom hardware is able to approach the sequential cycle-counts of an Intel Nehalem Core i7 superscalar processor, while consuming on average only 0.25 times the energy of an in-order Altera Nios IIf processor

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    On Extracting Course-Grained Function Parallelism from C Programs

    Get PDF
    To efficiently utilize the emerging heterogeneous multi-core architecture, it is essential to exploit the inherent coarse-grained parallelism in applications. In addition to data parallelism, applications like telecommunication, multimedia, and gaming can also benefit from the exploitation of coarse-grained function parallelism. To exploit coarse-grained function parallelism, the common wisdom is to rely on programmers to explicitly express the coarse-grained data-flow between coarse-grained functions using data-flow or streaming languages. This research is set to explore another approach to exploiting coarse-grained function parallelism, that is to rely on compiler to extract coarse-grained data-flow from imperative programs. We believe imperative languages and the von Neumann programming model will still be the dominating programming languages programming model in the future. This dissertation discusses the design and implementation of a memory data-flow analysis system which extracts coarse-grained data-flow from C programs. The memory data-flow analysis system partitions a C program into a hierarchy of program regions. It then traverses the program region hierarchy from bottom up, summarizing the exposed memory access patterns for each program region, meanwhile deriving a conservative producer-consumer relations between program regions. An ensuing top-down traversal of the program region hierarchy will refine the producer-consumer relations by pruning spurious relations. We built an in-lining based prototype of the memory data-flow analysis system on top of the IMPACT compiler infrastructure. We applied the prototype to analyze the memory data-flow of several MediaBench programs. The experiment results showed that while the prototype performed reasonably well for the tested programs, the in-lining based implementation may not efficient for larger programs. Also, there is still room in improving the effectiveness of the memory data-flow analysis system. We did root cause analysis for the inaccuracy in the memory data-flow analysis results, which provided us insights on how to improve the memory data-flow analysis system in the future

    Scaling to the end of silicon with EDGE architectures

    Full text link

    An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor

    Get PDF
    Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration

    A general framework to realize an abstract machine as an ILP processor with application to java

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Instruction scheduling in micronet-based asynchronous ILP processors

    Get PDF

    High-level automation of custom hardware design for high-performance computing

    Get PDF
    This dissertation focuses on efficient generation of custom processors from high-level language descriptions. Our work exploits compiler-based optimizations and transformations in tandem with high-level synthesis (HLS) to build high-performance custom processors. The goal is to offer a common multiplatform high-abstraction programming interface for heterogeneous compute systems where the benefits of custom reconfigurable (or fixed) processors can be exploited by the application developers. The research presented in this dissertation supports the following thesis: In an increasingly heterogeneous compute environment it is important to leverage the compute capabilities of each heterogeneous processor efficiently. In the case of FPGA and ASIC accelerators this can be achieved through HLS-based flows that (i) extract parallelism at coarser than basic block granularities, (ii) leverage common high-level parallel programming languages, and (iii) employ high-level source-to-source transformations to generate high-throughput custom processors. First, we propose a novel HLS flow that extracts instruction level parallelism beyond the boundary of basic blocks from C code. Subsequently, we describe FCUDA, an HLS-based framework for mapping fine-grained and coarse-grained parallelism from parallel CUDA kernels onto spatial parallelism. FCUDA provides a common programming model for acceleration on heterogeneous devices (i.e. GPUs and FPGAs). Moreover, the FCUDA framework balances multilevel granularity parallelism synthesis using efficient techniques that leverage fast and accurate estimation models (i.e. do not rely on lengthy physical implementation tools). Finally, we describe an advanced source-to-source transformation framework for throughput-driven parallelism synthesis (TDPS), which appropriately restructures CUDA kernel code to maximize throughput on FPGA devices. We have integrated the TDPS framework into the FCUDA flow to enable automatic performance porting of CUDA kernels designed for the GPU architecture onto the FPGA architecture
    corecore