50 research outputs found

    Structural optimization of numerical programs for high-level synthesis

    No full text
    This thesis introduces a new technique, and its associated tool SOAP, to automatically perform source-to-source optimization of numerical programs, specifically targeting the trade-off among numerical accuracy, latency, and resource usage as a high-level synthesis flow for FPGA implementations. A new intermediate representation, MIR, is introduced to carry out the abstraction and optimization of numerical programs. Equivalent structures in MIRs are efficiently discovered using methods based on formal semantics by taking into account axiomatic rules from real arithmetic, such as associativity, distributivity and others, in tandem with program equivalence rules that enable control-flow restructuring and eliminate redundant array accesses. For the first time, we bring rigorous approaches from software static analysis, specifically formal semantics and abstract interpretation, to bear on program transformation for high-level synthesis. New abstract semantics are developed to generate a computable subset of equivalent MIRs from an original MIR. Using formal semantics, three objectives are calculated for each MIR representing a pipelined numerical program: the accuracy of computation and an estimate of resource utilization in FPGA and the latency of program execution. The optimization of these objectives produces a Pareto frontier consisting of a set of equivalent MIRs. We thus go beyond existing literature by not only optimizing the precision requirements of an implementation, but changing the structure of the implementation itself. Using SOAP to optimize the structure of a variety of real world and artificially generated arithmetic expressions in single precision, we improve either their accuracy or the resource utilization by up to 60%. When applied to a suite of computational intensive numerical programs from PolyBench and Livermore Loops benchmarks, SOAP has generated circuits that enjoy up to a 12x speedup, with a simultaneous 7x increase in accuracy, at a cost of up to 4x more LUTs.Open Acces

    Vertical Optimizations of Convolutional Neural Networks for Embedded Systems

    Get PDF
    L'abstract 猫 presente nell'allegato / the abstract is in the attachmen

    High-level automation of custom hardware design for high-performance computing

    Get PDF
    This dissertation focuses on efficient generation of custom processors from high-level language descriptions. Our work exploits compiler-based optimizations and transformations in tandem with high-level synthesis (HLS) to build high-performance custom processors. The goal is to offer a common multiplatform high-abstraction programming interface for heterogeneous compute systems where the benefits of custom reconfigurable (or fixed) processors can be exploited by the application developers. The research presented in this dissertation supports the following thesis: In an increasingly heterogeneous compute environment it is important to leverage the compute capabilities of each heterogeneous processor efficiently. In the case of FPGA and ASIC accelerators this can be achieved through HLS-based flows that (i) extract parallelism at coarser than basic block granularities, (ii) leverage common high-level parallel programming languages, and (iii) employ high-level source-to-source transformations to generate high-throughput custom processors. First, we propose a novel HLS flow that extracts instruction level parallelism beyond the boundary of basic blocks from C code. Subsequently, we describe FCUDA, an HLS-based framework for mapping fine-grained and coarse-grained parallelism from parallel CUDA kernels onto spatial parallelism. FCUDA provides a common programming model for acceleration on heterogeneous devices (i.e. GPUs and FPGAs). Moreover, the FCUDA framework balances multilevel granularity parallelism synthesis using efficient techniques that leverage fast and accurate estimation models (i.e. do not rely on lengthy physical implementation tools). Finally, we describe an advanced source-to-source transformation framework for throughput-driven parallelism synthesis (TDPS), which appropriately restructures CUDA kernel code to maximize throughput on FPGA devices. We have integrated the TDPS framework into the FCUDA flow to enable automatic performance porting of CUDA kernels designed for the GPU architecture onto the FPGA architecture

    Instruction-set architecture synthesis for VLIW processors

    Get PDF

    FASTER: Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration

    Get PDF
    The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) EU FP7 project, aims to ease the design and implementation of dynamically changing hardware systems. Our motivation stems from the promise reconfigurable systems hold for achieving high performance and extending product functionality and lifetime via the addition of new features that operate at hardware speed. However, designing a changing hardware system is both challenging and time-consuming. FASTER facilitates the use of reconfigurable technology by providing a complete methodology enabling designers to easily specify, analyze, implement and verify applications on platforms with general-purpose processors and acceleration modules implemented in the latest reconfigurable technology. Our tool-chain supports both coarse- and fine-grain FPGA reconfiguration, while during execution a flexible run-time system manages the reconfigurable resources. We target three applications from different domains. We explore the way each application benefits from reconfiguration, and then we asses them and the FASTER tools, in terms of performance, area consumption and accuracy of analysis

    Towards hardware as a reconfigurable, elastic, and specialized service

    Get PDF
    As modern Data Center workloads become increasingly complex, constrained, and critical, mainstream CPU-centric computing has had ever more difficulty in keeping pace. Future data centers are moving towards a more fluid and heterogeneous model, with computation and communication no longer localized to commodity CPUs and routers. Next generation data-centric Data Centers will compute everywhere, whether data is stationary (e.g. in memory) or on the move (e.g. in network). While deploying FPGAs in NICS, as co-processors, in the router, and in Bump-in-the-Wire configurations is a step towards implementing the data-centric model, it is only part of the overall solution. The other part is actually leveraging this reconfigurable hardware. For this to happen, two problems must be addressed: code generation and deployment generation. By code generation we mean transforming abstract representations of an algorithm into equivalent hardware. Deployment generation refers to the runtime support needed to facilitate the execution of this hardware on an FPGA. Efforts at creating supporting tools in these two areas have thus far provided limited benefits. This is because the efforts are limited in one or more of the following ways: They i) do not provide fundamental solutions to a number of challenges, which makes them useful only to a limited group of (mostly) hardware developers, ii) are constrained in their scope, or iii) are ad hoc, i.e., specific to a single usage context, FPGA vendor, or Data Center configuration. Moreover, efforts in these areas have largely been mutually exclusive, which results in incompatibility across development layers; this requires wrappers to be designed to make interfaces compatible. As a result there is significant complexity and effort required to code and deploy efficient custom hardware for FPGAs; effort that may be orders-of-magnitude greater than for analogous software environments. The goal of this dissertation is to create a framework that enables reconfigurable logic in Data Centers to be targeted with the same level of effort as for a single CPU core. The underlying mechanism to this is a framework, which we refer to as Hardware as a Reconfigurable, Elastic and Specialized Service, or HaaRNESS. In this dissertation, we address two of the core challenges of HaaRNESS: reducing the complexity of code generation by constraining High Level Synthesis (HLS) toolflows, and replacing ad hoc models of deployment generation by generalizing and formalizing what is needed for a hardware Operating System. These parts are unified by the back-end of HLS toolflows which link generated compute pipelines with the operating system, and provide appropriate APIs, wrappers, and software runtimes. The contributions of this dissertation are the following: i) an empirically guided set of systematic transformations for generating high quality HLS code; ii) a framework for instrumenting HLS compiler to identify and remove optimization blockers; iii) a framework for RTL simulation and IP generation of HLS kernels for rapid turnaround; and iv) a framework for generalization and formalization of hardware operating systems to address the {\it ad hoc}'ness of existing deployment generation and ensure uniform structure and APIs

    Datapath and memory co-optimization for FPGA-based computation

    No full text
    With the large resource densities available on modern FPGAs it is often the available memory bandwidth that limits the parallelism (and therefore performance) that can be achieved. For this reason the focus of this thesis is the development of an integrated scheduling and memory optimisation methodology to allow high levels of parallelism to be exploited in FPGA based designs. A manual translation from C to hardware is first investigated as a case study, exposing a number of potential optimisation techniques that have not been exploited in existing work. An existing outer loop pipelining approach, originally developed for VLIW processors, is extended and adapted for application to FPGAs. The outer loop pipelining methodology is first developed to use a fixed memory subsystem design and then extended to automate the optimisation of the memory subsystem. This approach allocates arrays to physical memories and selects the set of data reuse structures to implement to match the available and required memory bandwidths as the pipelining search progresses. The final extension to this work is to include the partitioning of data from a single array across multiple physical memories, increasing the number of memory ports through which data my be accessed. The facility for loop unrolling is also added to increase the potential for parallelism and exploit the additional bandwidth that partitioning can provide. We describe our approach based on formal methodologies and present the results achieved when these methods are applied to a number of benchmarks. These results show the advantages of both extending pipelining to levels above the innermost loop and the co-optimisation of the datapath and memory subsystem
    corecore