115 research outputs found

    Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

    Get PDF
    Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

    Accelerating SPICE Model-Evaluation using FPGAs

    Get PDF
    Single-FPGA spatial implementations can provide an order of magnitude speedup over sequential microprocessor implementations for data-parallel, floating-point computation in SPICE model-evaluation. Model-evaluation is a key component of the SPICE circuit simulator and it is characterized by large irregular floating-point compute graphs. We show how to exploit the parallelism available in these graphs on single-FPGA designs with a low-overhead VLIW-scheduled architecture. Our architecture uses spatial floating-point operators coupled to local high-bandwidth memories and interconnected by a time-shared network. We retime operation inputs in the model-evaluation to allow independent scheduling of computation and communication. With this approach, we demonstrate speedups of 2–18× over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx Virtex 5 LX330T for a variety of SPICE device models

    Datapath and memory co-optimization for FPGA-based computation

    No full text
    With the large resource densities available on modern FPGAs it is often the available memory bandwidth that limits the parallelism (and therefore performance) that can be achieved. For this reason the focus of this thesis is the development of an integrated scheduling and memory optimisation methodology to allow high levels of parallelism to be exploited in FPGA based designs. A manual translation from C to hardware is first investigated as a case study, exposing a number of potential optimisation techniques that have not been exploited in existing work. An existing outer loop pipelining approach, originally developed for VLIW processors, is extended and adapted for application to FPGAs. The outer loop pipelining methodology is first developed to use a fixed memory subsystem design and then extended to automate the optimisation of the memory subsystem. This approach allocates arrays to physical memories and selects the set of data reuse structures to implement to match the available and required memory bandwidths as the pipelining search progresses. The final extension to this work is to include the partitioning of data from a single array across multiple physical memories, increasing the number of memory ports through which data my be accessed. The facility for loop unrolling is also added to increase the potential for parallelism and exploit the additional bandwidth that partitioning can provide. We describe our approach based on formal methodologies and present the results achieved when these methods are applied to a number of benchmarks. These results show the advantages of both extending pipelining to levels above the innermost loop and the co-optimisation of the datapath and memory subsystem

    Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτονικών

    Get PDF
    The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Design and application of reconfigurable circuits and systems

    No full text
    Open Acces

    Compiling dataflow graphs into hardware

    Get PDF
    Department Head: L. Darrell Whitley.2005 Fall.Includes bibliographical references (pages 121-126).Conventional computers are programmed by supplying a sequence of instructions that perform the desired task. A reconfigurable processor is "programmed" by specifying the interconnections between hardware components, thereby creating a "hardwired" system to do the particular task. For some applications such as image processing, reconfigurable processors can produce dramatic execution speedups. However, programming a reconfigurable processor is essentially a hardware design discipline, making programming difficult for application programmers who are only familiar with software design techniques. To bridge this gap, a programming language, called SA-C (Single Assignment C, pronounced "sassy"), has been designed for programming reconfigurable processors. The process involves two main steps - first, the SA-C compiler analyzes the input source code and produces a hardware-independent intermediate representation of the program, called a dataflow graph (DFG). Secondly, this DFG is combined with hardware-specific information to create the final configuration. This dissertation describes the design and implementation of a system that performs the DFG to hardware translation. The DFG is broken up into three sections: the data generators, the inner loop body, and the data collectors. The second of these, the inner loop body, is used to create a computational structure that is unique for each program. The other two sections are implemented by using prebuilt modules, parameterized for the particular problem. Finally, a "glue module" is created to connect the various pieces into a complete interconnection specification. The dissertation also explores optimizations that can be applied while processing the DFG, to improve performance. A technique for pipelining the inner loop body is described that uses an estimation tool for the propagation delay of the nodes within the dataflow graph. A scheme is also described that identifies subgraphs with the dataflow graph that can be replaced with lookup tables. The lookup tables provide a faster implementation than random logic in some instances

    The Potential for a GPU-Like Overlay Architecture for FPGAs

    Get PDF
    We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-level Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath

    High throughput multidimensional tridiagonal system solvers on FPGAs

    Get PDF
    We present a high performance tridiagonal solver library for Xilinx FPGAs optimized for multiple multi-dimensional systems common in real-world applications. An analytical performance model is developed and used to explore the design space and obtain rapid performance estimates that are over 85% accurate. This library achieves an order of magnitude better performance when solving large batches of systems than previous FPGA work. A detailed comparison with a current state-of-the-art GPU library for multi-dimensional tridiagonal systems on an Nvidia V100 GPU shows the FPGA achieving competitive or better runtime and significant energy savings of over 30%. Through this design, we learn lessons about the types of applications where FPGAs can challenge the current dominance of GPUs

    A Survey and Evaluation of FPGA High-Level Synthesis Tools

    Get PDF
    High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today's system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources
    corecore