33 research outputs found

    A Hybrid Partially Reconfigurable Overlay Supporting Just-In-Time Assembly of Custom Accelerators on FPGAs

    Get PDF
    The state of the art in design and development flows for FPGAs are not sufficiently mature to allow programmers to implement their applications through traditional software development flows. The stipulation of synthesis as well as the requirement of background knowledge on the FPGAs\u27 low-level physical hardware structure are major challenges that prevent programmers from using FPGAs. The reconfigurable computing community is seeking solutions to raise the level of design abstraction at which programmers must operate, and move the synthesis process out of the programmers\u27 path through the use of overlays. A recent approach, Just-In-Time Assembly (JITA), was proposed that enables hardware accelerators to be assembled at runtime, all from within a traditional software compilation flow. The JITA approach presents a promising path to constructing hardware designs on FPGAs using pre-synthesized parallel programming patterns, but suffers from two major limitations. First, all variant programming patterns must be pre-synthesized. Second, conditional operations are not supported. In this thesis, I present a new reconfigurable overlay, URUK, that overcomes the two limitations imposed by the JITA approach. Similar to the original JITA approach, the proposed URUK overlay allows hardware accelerators to be constructed on FPGAs through software compilation flows. To this basic capability, URUK adds additional support to enable the assembly of presynthesized fine-grained computational operators to be assembled within the FPGA. This thesis provides analysis of URUK from three different perspectives; utilization, performance, and productivity. The analysis includes comparisons against High-Level Synthesis (HLS) and the state of the art approach to creating static overlays. The tradeoffs conclude that URUK can achieve approximately equivalent performance for algebra operations compared to HLS custom accelerators, which are designed with simple experience on FPGAs. Further, URUK shows a high degree of flexibility for runtime placement and routing of the primitive operations. The analysis shows how this flexibility can be leveraged to reduce communication overhead among tiles, compared to traditional static overlays. The results also show URUK can enable software programmers without any hardware skills to create hardware accelerators at productivity levels consistent with software development and compilation

    Throughput oriented FPGA overlays using DSP blocks

    Get PDF
    Design productivity is a major concern preventing the mainstream adoption of FPGAs. Overlay architectures have emerged as one possible solution to this challenge, offering fast compilation and software-like programmability. However, overlays typically suffer from area and performance overheads due to limited consideration for the underlying FPGA architecture. These overlays have often been of limited size, supporting only relatively small compute kernels. This paper examines the possibility of developing larger, more efficient, overlays using multiple DSP blocks and then maximising utilisation by mapping multiple instances of kernels simultaneously onto the overlay to exploit kernel level parallelism. We show a significant improvement in achievable overlay size and overlay utilisation, with a reduction of almost 70% in the overlay tile requirement compared to existing overlay architectures, an operating frequency in excess of 300 MHz, and kernel throughputs of almost 60 GOPS

    Mocarabe: High-Performance Time-Multiplexed Overlays for FPGAs

    Get PDF
    Coarse-grained reconfigurable array (CGRA) overlays can improve dataflow kernel throughput by an order of magnitude over Vivado HLS on Xilinx Alveo U280. This is possible with a combination of carefully floorplanned high-frequency (645 - 768 MHz Torus, 788 - 856 MHz Mesh, 583 - 746 MHz BFT) design and a scalable, communication-aware compiler. Our CGRA architecture supports configurable Processing Element (PE) functionality supported by a configurable number of communication channels to match application demands. Compared to recent FPGA overlays like 4×4 ADRES and HyCUBE implementations in CGRA-ME, our design operates at a faster clock frequency by up to 3.4×, while scaling to an orders-of-magnitude larger array size of 19×69 on Xilinx Alveo U280. We propose a novel topology agnostic ILP placer that formulates the CGRA placement problem into an ILP problem. Our ILP placer optimizes placement regardless of topology and even for non-linear objective functions by using pre-computed placement costs as inputs to the ILP problem formulation. Using the ILP placer reduces placement quadratic wirelength up to 37% compared to the commonly used simulated annealing approach but increases runtime from less than a minute to hours. Our communication-aware compiler targets HLS objectives such as initiation interval (II) and minimizes communication cost using an integer linear programming (ILP) formulation. Unlike SDC schedulers in FPGA HLS tools, we treat data movement as a first-class citizen by encoding the space and time resources of the communication network in the ILP formulation. Given the same constraints on operational resources as Vivado HLS, we can retain our target II and achieve up to 9.2× higher frequency. We compare Torus and Mesh topologies, and show Mesh has less latency per area compared to Torus for the same benchmarks

    Design and Programming Methods for Reconfigurable Multi-Core Architectures using a Network-on-Chip-Centric Approach

    Get PDF
    A current trend in the semiconductor industry is the use of Multi-Processor Systems-on-Chip (MPSoCs) for a wide variety of applications such as image processing, automotive, multimedia, and robotic systems. Most applications gain performance advantages by executing parallel tasks on multiple processors due to the inherent parallelism. Moreover, heterogeneous structures provide high performance/energy efficiency, since application-specific processing elements (PEs) can be exploited. The increasing number of heterogeneous PEs leads to challenging communication requirements. To overcome this challenge, Networks-on-Chip (NoCs) have emerged as scalable on-chip interconnect. Nevertheless, NoCs have to deal with many design parameters such as virtual channels, routing algorithms and buffering techniques to fulfill the system requirements. This thesis highly contributes to the state-of-the-art of FPGA-based MPSoCs and NoCs. In the following, the three major contributions are introduced. As a first major contribution, a novel router concept is presented that efficiently utilizes communication times by performing sequences of arithmetic operations on the data that is transferred. The internal input buffers of the routers are exchanged with processing units that are capable of executing operations. Two different architectures of such processing units are presented. The first architecture provides multiply and accumulate operations which are often used in signal processing applications. The second architecture introduced as Application-Specific Instruction Set Routers (ASIRs) contains a processing unit capable of executing any operation and hence, it is not limited to multiply and accumulate operations. An internal processing core located in ASIRs can be developed in C/C++ using high-level synthesis. The second major contribution comprises application and performance explorations of the novel router concept. Models that approximate the achievable speedup and the end-to-end latency of ASIRs are derived and discussed to show the benefits in terms of performance. Furthermore, two applications using an ASIR-based MPSoC are implemented and evaluated on a Xilinx Zynq SoC. The first application is an image processing algorithm consisting of a Sobel filter, an RGB-to-Grayscale conversion, and a threshold operation. The second application is a system that helps visually impaired people by navigating them through unknown indoor environments. A Light Detection and Ranging (LIDAR) sensor scans the environment, while Inertial Measurement Units (IMUs) measure the orientation of the user to generate an audio signal that makes the distance as well as the orientation of obstacles audible. This application consists of multiple parallel tasks that are mapped to an ASIR-based MPSoC. Both applications show the performance advantages of ASIRs compared to a conventional NoC-based MPSoC. Furthermore, dynamic partial reconfiguration in terms of relocation and security aspects are investigated. The third major contribution refers to development and programming methodologies of NoC-based MPSoCs. A software-defined approach is presented that combines the design and programming of heterogeneous MPSoCs. In addition, a Kahn-Process-Network (KPN) –based model is designed to describe parallel applications for MPSoCs using ASIRs. The KPN-based model is extended to support not only the mapping of tasks to NoC-based MPSoCs but also the mapping to ASIR-based MPSoCs. A static mapping methodology is presented that assigns tasks to ASIRs and processors for a given KPN-model. The impact of external hardware components such as sensors, actuators and accelerators connected to the processors is also discussed which makes the approach of high interest for embedded systems

    Software instruction caching

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 185-193).As microprocessor complexities and costs skyrocket, designers are looking for ways to simplify their designs to reduce costs, improve energy efficiency, or squeeze more computational elements on each chip. This is particularly true for the embedded domain where cost and energy consumption are paramount. Software instruction caches have the potential to provide the required performance while using simpler, more efficient hardware. A software cache consists of a simple array memory (such as a scratchpad) and a software system that is capable of automatically managing that memory as a cache. Software caches have several advantages over traditional hardware caches. Without complex cache-management logic, the processor hardware is cheaper and easier to design, verify and manufacture. The reduced access energy of simple memories can result in a net energy savings if management overhead is kept low. Software caches can also be customized to each individual program's needs, improving performance or eliminating unpredictable timing for real-time embedded applications. The greatest challenge for a software cache is providing good performance using general-purpose instructions for cache management rather than specially-designed hardware. This thesis designs and implements a working system (Flexicache) on an actual embedded processor and uses it to investigate the strengths and weaknesses of software instruction caches. Although both data and instruction caches can be implemented in software, very different techniques are used to optimize performance; this work focuses exclusively on software instruction caches. The Flexicache system consists of two software components: a static off-line preprocessor to add caching to an application and a dynamic runtime system to manage memory during execution. Key interfaces and optimizations are identified and characterized. The system is evaluated in detail from the standpoints of both performance and energy consumption. The results indicate that software instruction caches can perform comparably to hardware caches in embedded processors. On most benchmarks, the overhead relative to a hardware cache is less than 12% and can be as low as 2.4%. At the same time, the software cache uses up to 6% less energy. This is achieved using a simple, directly-addressed memory and without requiring any complex, specialized hardware structures.by Jason Eric Miller.Ph.D

    Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices

    Get PDF
    This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Reconfigurable microarchitectures at the programmable logic interface

    Get PDF
    corecore