395 research outputs found

    Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration

    Full text link
    This paper presents an FPGA runtime framework that demonstrates the feasibility of using dynamic partial reconfiguration (DPR) for time-sharing an FPGA by multiple realtime computer vision pipelines. The presented time-sharing runtime framework manages an FPGA fabric that can be round-robin time-shared by different pipelines at the time scale of individual frames. In this new use-case, the challenge is to achieve useful performance despite high reconfiguration time. The paper describes the basic runtime support as well as four optimizations necessary to achieve realtime performance given the limitations of DPR on today's FPGAs. The paper provides a characterization of a working runtime framework prototype on a Xilinx ZC706 development board. The paper also reports the performance of realtime computer vision pipelines when time-shared

    TROUTE : a reconfigurability-aware FPGA router

    Get PDF

    High-speed dynamic partial reconfiguration for field programmable gate arrays

    Get PDF
    With dynamically and partially reconfigurable designs, it is necessary that the speed of the reconfiguration be accomplished in a time that is sufficiently small such that the operation of reconfiguration is not the limiting factor in the process. Therefore, the communication between the source of configuration and the configurable unit must be made as fast as possible. The aim of this work is to use an embedded controller internal to the FPGA to control the reconfiguration process and obtain the maximum speed at which reconfiguration can occur, with current FPGA technology. The use of Direct Memory Access (DMA) driven operations instead of the current arbitrated bus architectures yielded a 30% increase in the speed of reconfiguration compared to other methods such as OPB_HWICAP and PLB_HWICAP [1]. The use of interrupt driven partial reconfiguration was also introduced, allowing the processor to switch to other tasks during the reconfiguration operation. All of these contributions lead to significant performance improvements over current partial reconfiguration subsystems. The configuration controller was tested using four partially reconfigurable system implementations: (i) one targeting the Hard IP PowerPC405 on Virtex-4, (ii) a second targeting the Soft IP MicroBlaze on Virtex-5, (iii) a third targeting the Hard IP PowerPC440 on Virtex-5, and (iv) a fourth system targets the Hard IP PowerPC440 on Virtex-5 capable of adaptive feedback. The adaptive feedback Virtex-5 system can use internal voltage and temperature measurements from the Xilinx System Monitor IP to dynamically increase or decrease the speed of reconfiguration and/or change other reconfigurable aspects of the system to better match the environment

    Design of Reconfigurable Crossbar Switch for BiNoC Router

    Get PDF
    this paper presents implementation of 10x10 reconfigurable crossbar switch (RCS) architecture for Dynamic Self-Reconfigurable BiNoC Architecture for Network On Chip. Its main purpose is to increase the performance, flexibility. This paper presents a VHDL based cycle accurate register transfer level model for evaluating the, Power and Area of reconfigurable cross bar switch in BiNoC architectures. We implemented a parameterized register transfer level design of reconfigurable crossbar switch (RCS) architecture. The design is parameterized on (i) size of packets, (ii) length and width of physical links, (iii) number, and depth of arbiters, and (iv) switching technique. The paper discusses in detail the architecture and characterization of the various reconfigurable crossbar switch (RCS) architecture components. The characterized values were integrated into the VHDL based RTL design to build the cycle accurate performance model. In this paper we show the result of simple 10x10 crossbar switch .The results include VHDL simulation of RCS on Xilinx ISE 13.1 software tool

    A Field Programmable Gate Array Architecture for Two-Dimensional Partial Reconfiguration

    Get PDF
    Reconfigurable machines can accelerate many applications by adapting to their needs through hardware reconfiguration. Partial reconfiguration allows the reconfiguration of a portion of a chip while the rest of the chip is busy working on tasks. Operating system models have been proposed for partially reconfigurable machines to handle the scheduling and placement of tasks. They are called OS4RC in this dissertation. The main goal of this research is to address some problems that come from the gap between OS4RC and existing chip architectures and the gap between OS4RC models and practical applications. Some existing OS4RC models are based on an impractical assumption that there is no data exchange channel between IP (Intellectual Property) circuits residing on a Field Programmable Gate Array (FPGA) chip and between an IP circuit and FPGA I/O pins. For models that do not have such an assumption, their inter-IP communication channels have severe drawbacks. Those channels do not work well with 2-D partial reconfiguration. They are not suitable for intensive data stream processing. And frequently they are very complicated to design and very expensive. To address these problems, a new chip architecture that can better support inter-IP and IP-I/O communication is proposed and a corresponding OS4RC kernel is then specified. The proposed FPGA architecture is based on an array of clusters of configurable logic blocks, with each cluster serving as a partial reconfiguration unit, and a mesh of segmented buses that provides inter-IP and IP-I/O communication channels. The proposed OS4RC kernel takes care of the scheduling, placement, and routing of circuits under the constraints of the proposed architecture. Features of the new architecture in turns reduce the kernel execution times and enable the runtime scheduling, placement and routing. The area cost and the configuration memory size of the new chip architecture are calculated and analyzed. And the efficiency of the OS4RC kernel is evaluated via simulation using three different task models

    Bridging the Gap between Relocatability and Available Technology: The Erlangen Slot Machine

    Get PDF
    We present an FPGA-based reconfigurable platform called Erlangen Slot Machine (ESM). The main advantages of this platform are: First, the possibility for each module to access peripherals independent from its location through a programmable crossbar, and local SRAM banks for individual modules. This physical design eases the implementation of run-time reconfigurable partial modules and enables an unrestricted relocation of modules on the device. We present our two-board ESM implementation and demonstrate a partially reconfigurable video filter application as well as a relocatable computer game including a dedicated inter-module communication scheme

    Via-switch FPGA with transistor-free programmability enabling energy-efficient near-memory parallel computation

    Get PDF
    We are developing field-programmable gate arrays (FPGAs) with a new non-volatile switch called via-switch. In via-switch FPGAs (VS-FPGAs), the via-switches required for reconfiguration are placed in the routing layer so that the entire transistor layer can be utilized for computing, and higher implementation density can be achieved compared to conventional SRAM FPGAs. Furthermore, since arithmetic units and memories for computing can be placed under the via-switch crossbar for routing, large-scale parallel operations can be realized where the memory and the arithmetic unit are adjacent to each other. These features enable operation with high energy efficiency. This article reports 65 nm prototype fabrication results and predicted the performance of the VS-FPGA designed for AI applications. We also present the developed application mapping flow and crossbar programming method. The VS-FPGA closes the gap between FPGA and application-specific integrated circuits (ASIC) with the performance advantage of the via-switch and via-switch copy scheme for FPGA-to-ASIC migration, contributing to the expansion of the FPGA usage

    A Dynamically Reconfigurable Parallel Processing Framework with Application to High-Performance Video Processing

    Get PDF
    Digital video processing demands have and will continue to grow at unprecedented rates. Growth comes from ever increasing volume of data, demand for higher resolution, higher frame rates, and the need for high capacity communications. Moreover, economic realities force continued reductions in size, weight and power requirements. The ever-changing needs and complexities associated with effective video processing systems leads to the consideration of dynamically reconfigurable systems. The goal of this dissertation research was to develop and demonstrate the viability of integrated parallel processing system that effectively and efficiently apply pre-optimized hardware cores for processing video streamed data. Digital video is decomposed into packets which are then distributed over a group of parallel video processing cores. Real time processing requires an effective task scheduler that distributes video packets efficiently to any of the reconfigurable distributed processing nodes across the framework, with the nodes running on FPGA reconfigurable logic in an inherently Virtual\u27 mode. The developed framework, coupled with the use of hardware techniques for dynamic processing optimization achieves an optimal cost/power/performance realization for video processing applications. The system is evaluated by testing processor utilization relative to I/O bandwidth and algorithm latency using a separable 2-D FIR filtering system, and a dynamic pixel processor. For these applications, the system can achieve performance of hundreds of 640x480 video frames per second across an eight lane Gen I PCIe bus. Overall, optimal performance is achieved in the sense that video data is processed at the maximum possible rate that can be streamed through the processing cores. This performance, coupled with inherent ability to dynamically add new algorithms to the described dynamically reconfigurable distributed processing framework, creates new opportunities for realizable and economic hardware virtualization.\u2

    FPGA structures for high speed and low overhead dynamic circuit specialization

    Get PDF
    A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name “field programmable”. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the user’s needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications, namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures

    Control Plane Hardware Design for Optical Packet Switched Data Centre Networks

    Get PDF
    Optical packet switching for intra-data centre networks is key to addressing traffic requirements. Photonic integration and wavelength division multiplexing (WDM) can overcome bandwidth limits in switching systems. A promising technology to build a nanosecond-reconfigurable photonic-integrated switch, compatible with WDM, is the semiconductor optical amplifier (SOA). SOAs are typically used as gating elements in a broadcast-and-select (B\&S) configuration, to build an optical crossbar switch. For larger-size switching, a three-stage Clos network, based on crossbar nodes, is a viable architecture. However, the design of the switch control plane, is one of the barriers to packet switching; it should run on packet timescales, which becomes increasingly challenging as line rates get higher. The scheduler, used for the allocation of switch paths, limits control clock speed. To this end, the research contribution was the design of highly parallel hardware schedulers for crossbar and Clos network switches. On a field-programmable gate array (FPGA), the minimum scheduler clock period achieved was 5.0~ns and 5.4~ns, for a 32-port crossbar and Clos switch, respectively. By using parallel path allocation modules, one per Clos node, a minimum clock period of 7.0~ns was achieved, for a 256-port switch. For scheduler application-specific integrated circuit (ASIC) synthesis, this reduces to 2.0~ns; a record result enabling scalable packet switching. Furthermore, the control plane was demonstrated experimentally. Moreover, a cycle-accurate network emulator was developed to evaluate switch performance. Results showed a switch saturation throughput at a traffic load 60\% of capacity, with sub-microsecond packet latency, for a 256-port Clos switch, outperforming state-of-the-art optical packet switches
    corecore