96 research outputs found

    FPGA structures for high speed and low overhead dynamic circuit specialization

    Get PDF
    A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name ā€œfield programmableā€. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the userā€™s needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications, namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures

    High throughput spatial convolution filters on FPGAs

    Get PDF
    Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30ā€“60 FPS, while maintaining functional flexibility

    Methods for FPGA pre-processing of data for the ALOFT readout system

    Get PDF
    In 2017 the Airborne Lightning Observatory for FEGS & TGFs (ALOFT) campaign was completed with the goal of studying thundercloud related high energy phenomena, namely Terrestrial Gamma-Ray Flashes (TGFs) and Gamma-ray Glows, and their connections. This was done on a high-altitude airplane flying over the thunderclouds. The campaign observed two gamma-ray glows but no TGFs. A new ALOFT campaign has been confirmed for 2023 and will contain several upgrades and improvements. Some of these improvements includes developing two new detectors that will be added to the instrument, as well as a new readout system based around a ZYNQ-7000 series System on Chip (SoC). Through this thesis, my contribution to this development has been two-folded: 1. Verify that the integrated ZYNQ Serial Peripheral Interface (SPI) controller can be used to interface and configure the Analog to Digital Converters (ADCs) in the new detectors. This involves a) making a converter between the four wire SPI used by the ZYNQ and the three wire SPI used by the ADCs, b) modelling the behaviour of how the ADCs control registers communicate, and c) verify if the ZYNQ SPI controller can transmit the protocols required to configure the ADCs, and that it can access the FPGA part of the ZYNQ SoC. 2. During TGFs the data output of the new detectors far exceeds the ability of the system to send all the raw data to storage, requiring the data to be temporarily buffered. The buffer capacity needed to guarantee that no data is lost, would consume 73% of the total shared buffer capacity available to the FPGA part of the system. This leaves very little available capacity for the remaining detectors and any FPGA modules. In this thesis different approaches to reduce the required buffer requirements has been explored. My work in this thesis shows that: 1. The ZYNQ SPI controller can be used to configure the ADCs by a) showing that the required converter can easily be made in the FPGA part of the ZYNQ, by b) modelling how the ADCs are configured can be created, and by c) testing that the ZYNQ SPI controller is compatible with the protocols used to interact and configure the ADCs. 2. After exploring both real-time analysis of the data on the SoC and compressing the raw data, the safest methods to use is compression. It is also shown that with the compression explored and considered, it is no problem reducing the buffer requirement from 73% to less than 30% of the total shared capacity.MasteroppgƄve i fysikkPHYS399MAMN-PHY

    A hardware/application overlay model for large-scale neuromorphic simulation

    Get PDF
    Neuromorphic computing is gaining momentum as an alternative hardware platform for large-scale neural simulation. However, with several major devices and systems available and planned, often with very different characteristics, it is not always clear which platform is suitable for which application. Simulating the platform on conventional computers is typically too slow to be of use, but an alternative approach is to implement an ā€˜emulationā€™ of the hardware in FPGAs which can execute at near-hardware speeds but does not commit to a specific hardware architecture. We present an overlay model - a method which superimposes bespoke features on top of a standard template - in both hardware and software to implement neuromorphic architectures using the POETS (Partially Ordered Event Triggered Systems) system. This combination of overlays permits very large-scale simulations to be performed in real time for hardware exploration or application verification, while retaining the flexibility to redefine either the hardware or software layer, if results indicate potential to improve performance, or significant design problems. Using this system we simulate up to 500,000 neurons on a single-box system, that can be scaled to āˆ¼4,000,000 neurons in an 8-box configuration. Results indicate the crucial constraint for real-time simulation: peak input spike rate per neuron; and help to optimise both hardware and software around neural application requirements. The preliminary architecture demonstrates the feasibility of an overlay model, while indicating directions for future neuromorphic systems. With POETS, we introduce a platform that can help to shape and investigate the neuromorphic architectures of the future

    Methodology for complex dataflow application development

    Get PDF
    This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist. The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance. The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms. The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
    • ā€¦
    corecore