355 research outputs found

    High throughput spatial convolution filters on FPGAs

    Get PDF
    Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility

    Exploiting All-Programmable System on Chips for Closed-Loop Real-Time Neural Interfaces

    Get PDF
    High-density microelectrode arrays (HDMEAs) feature thousands of recording electrodes in a single chip with an area of few square millimeters. The obtained electrode density is comparable and even higher than the typical density of neuronal cells in cortical cultures. Commercially available HDMEA-based acquisition systems are able to record the neural activity from the whole array at the same time with submillisecond resolution. These devices are a very promising tool and are increasingly used in neuroscience to tackle fundamental questions regarding the complex dynamics of neural networks. Even if electrical or optical stimulation is generally an available feature of such systems, they lack the capability of creating a closed-loop between the biological neural activity and the artificial system. Stimuli are usually sent in an open-loop manner, thus violating the inherent working basis of neural circuits that in nature are constantly reacting to the external environment. This forbids to unravel the real mechanisms behind the behavior of neural networks. The primary objective of this PhD work is to overcome such limitation by creating a fullyreconfigurable processing system capable of providing real-time feedback to the ongoing neural activity recorded with HDMEA platforms. The potentiality of modern heterogeneous FPGAs has been exploited to realize the system. In particular, the Xilinx Zynq All Programmable System on Chip (APSoC) has been used. The device features reconfigurable logic, specialized hardwired blocks, and a dual-core ARM-based processor; the synergy of these components allows to achieve high elaboration performances while maintaining a high level of flexibility and adaptivity. The developed system has been embedded in an acquisition and stimulation setup featuring the following platforms: \u2022 3\ub7Brain BioCam X, a state-of-the-art HDMEA-based acquisition platform capable of recording in parallel from 4096 electrodes at 18 kHz per electrode. \u2022 PlexStim\u2122 Electrical Stimulator System, able to generate electrical stimuli with custom waveforms to 16 different output channels. \u2022 Texas Instruments DLP\uae LightCrafter\u2122 Evaluation Module, capable of projecting 608x684 pixels images with a refresh rate of 60 Hz; it holds the function of optical stimulation. All the features of the system, such as band-pass filtering and spike detection of all the recorded channels, have been validated by means of ex vivo experiments. Very low-latency has been achieved while processing the whole input data stream in real-time. In the case of electrical stimulation the total latency is below 2 ms; when optical stimuli are needed, instead, the total latency is a little higher, being 21 ms in the worst case. The final setup is ready to be used to infer cellular properties by means of closed-loop experiments. As a proof of this concept, it has been successfully used for the clustering and classification of retinal ganglion cells (RGCs) in mice retina. For this experiment, the light-evoked spikes from thousands of RGCs have been correctly recorded and analyzed in real-time. Around 90% of the total clusters have been classified as ON- or OFF-type cells. In addition to the closed-loop system, a denoising prototype has been developed. The main idea is to exploit oversampling techniques to reduce the thermal noise recorded by HDMEAbased acquisition systems. The prototype is capable of processing in real-time all the input signals from the BioCam X, and it is currently being tested to evaluate the performance in terms of signal-to-noise-ratio improvement

    Design of approximate overclocked datapath

    Get PDF
    Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design. In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors. Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces

    FPGA structures for high speed and low overhead dynamic circuit specialization

    Get PDF
    A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name “field programmable”. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the user’s needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications, namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures

    Python based FPGA design-flow

    Get PDF
    This dissertation undertakes to establish the feasibility of using MyHDL as a basis on which to develop an FPGA-based DSP tool-ow to target CASPER hardware. MyHDL is an open-source package which enables Python to be used as a hardware definition and verification language. As Python is a high-level language, hardware designers can use it to model and simulate designs, without needing detailed knowledge of the underlying hardware. MyHDL has the ability to convert designs to Verilog or VHDL allowing it to integrate into the more traditional design-ow. The CASPER tool- ow exhibits limitations such as design environment instability and high licensing fees. These shortcomings are addressed by MyHDL. To enable CASPER to take advantage of its powerful features, MyHDL is incorporated into a next generation tool-ow which enables high-level designs to be fully simulated and implemented on the CASPER hardware architectures

    High-level synthesis for reduction of WCET in real-time systems

    Get PDF
    corecore