325 research outputs found
The UTMOST: A hybrid digital signal processor transforms the MOST
The Molonglo Observatory Synthesis Telescope (MOST) is an 18,000 square meter
radio telescope situated some 40 km from the city of Canberra, Australia. Its
operating band (820-850 MHz) is now partly allocated to mobile phone
communications, making radio astronomy challenging. We describe how the
deployment of new digital receivers (RX boxes), Field Programmable Gate Array
(FPGA) based filterbanks and server-class computers equipped with 43 GPUs
(Graphics Processing Units) has transformed MOST into a versatile new
instrument (the UTMOST) for studying the dynamic radio sky on millisecond
timescales, ideal for work on pulsars and Fast Radio Bursts (FRBs). The
filterbanks, servers and their high-speed, low-latency network form part of a
hybrid solution to the observatory's signal processing requirements. The
emphasis on software and commodity off-the-shelf hardware has enabled rapid
deployment through the re-use of proven 'software backends' for its signal
processing. The new receivers have ten times the bandwidth of the original MOST
and double the sampling of the line feed, which doubles the field of view. The
UTMOST can simultaneously excise interference, make maps, coherently dedisperse
pulsars, and perform real-time searches of coherent fan beams for dispersed
single pulses. Although system performance is still sub-optimal, a pulsar
timing and FRB search programme has commenced and the first UTMOST maps have
been made. The telescope operates as a robotic facility, deciding how to
efficiently target pulsars and how long to stay on source, via feedback from
real-time pulsar folding. The regular timing of over 300 pulsars has resulted
in the discovery of 7 pulsar glitches and 3 FRBs. The UTMOST demonstrates that
if sufficient signal processing can be applied to the voltage streams it is
possible to perform innovative radio science in hostile radio frequency
environments.Comment: 12 pages, 6 figure
Exploiting All-Programmable System on Chips for Closed-Loop Real-Time Neural Interfaces
High-density microelectrode arrays (HDMEAs) feature thousands of recording electrodes
in a single chip with an area of few square millimeters. The obtained electrode density is
comparable and even higher than the typical density of neuronal cells in cortical cultures.
Commercially available HDMEA-based acquisition systems are able to record the neural
activity from the whole array at the same time with submillisecond resolution. These devices
are a very promising tool and are increasingly used in neuroscience to tackle fundamental
questions regarding the complex dynamics of neural networks. Even if electrical or optical
stimulation is generally an available feature of such systems, they lack the capability of
creating a closed-loop between the biological neural activity and the artificial system. Stimuli
are usually sent in an open-loop manner, thus violating the inherent working basis of neural
circuits that in nature are constantly reacting to the external environment. This forbids to
unravel the real mechanisms behind the behavior of neural networks.
The primary objective of this PhD work is to overcome such limitation by creating a fullyreconfigurable
processing system capable of providing real-time feedback to the ongoing
neural activity recorded with HDMEA platforms. The potentiality of modern heterogeneous
FPGAs has been exploited to realize the system. In particular, the Xilinx Zynq All Programmable
System on Chip (APSoC) has been used. The device features reconfigurable
logic, specialized hardwired blocks, and a dual-core ARM-based processor; the synergy of
these components allows to achieve high elaboration performances while maintaining a high
level of flexibility and adaptivity. The developed system has been embedded in an acquisition
and stimulation setup featuring the following platforms:
\u2022 3\ub7Brain BioCam X, a state-of-the-art HDMEA-based acquisition platform capable of
recording in parallel from 4096 electrodes at 18 kHz per electrode.
\u2022 PlexStim\u2122 Electrical Stimulator System, able to generate electrical stimuli with
custom waveforms to 16 different output channels.
\u2022 Texas Instruments DLP\uae LightCrafter\u2122 Evaluation Module, capable of projecting
608x684 pixels images with a refresh rate of 60 Hz; it holds the function of optical
stimulation.
All the features of the system, such as band-pass filtering and spike detection of all the
recorded channels, have been validated by means of ex vivo experiments. Very low-latency
has been achieved while processing the whole input data stream in real-time. In the case
of electrical stimulation the total latency is below 2 ms; when optical stimuli are needed,
instead, the total latency is a little higher, being 21 ms in the worst case.
The final setup is ready to be used to infer cellular properties by means of closed-loop
experiments. As a proof of this concept, it has been successfully used for the clustering
and classification of retinal ganglion cells (RGCs) in mice retina. For this experiment, the
light-evoked spikes from thousands of RGCs have been correctly recorded and analyzed in
real-time. Around 90% of the total clusters have been classified as ON- or OFF-type cells.
In addition to the closed-loop system, a denoising prototype has been developed. The main
idea is to exploit oversampling techniques to reduce the thermal noise recorded by HDMEAbased
acquisition systems. The prototype is capable of processing in real-time all the input
signals from the BioCam X, and it is currently being tested to evaluate the performance in
terms of signal-to-noise-ratio improvement
FPGA structures for high speed and low overhead dynamic circuit specialization
A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name “field programmable”. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the user’s needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration
is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the
configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses
the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications,
namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a
significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures
How to efficiently reconfigure tunable lookup tables for dynamic circuit specialization
Dynamic Circuit Specialization is used to optimize the implementation of a parameterized application on an FPGA. Instead of implementing the parameters as regular inputs, in the DCS approach these inputs are implemented as constants. When the parameter values change, the design is reoptimized for the new constant values by reconfiguring the FPGA. This allows faster and more resource-efficient implementation but investigations have shown that reconfiguration time is the major limitation for DCS implementation on Xilinx FPGAs. The limitation arises from the use of inefficient reconfiguration methods in conventional DCS implementation. To address this issue, we propose different approaches to reduce the reconfiguration time drastically and improve the reconfiguration speed. In this context, this paper presents the use of custom reconfiguration controllers and custom reconfiguration software drivers, along with placement constraints to shorten the reconfiguration time. Our results show an improvement in the reconfiguration speed by at least a factor 14 by using Xilinx reconfiguration controller along with placement constraints. However, the improvement can go up to a factor 40 with the combination of a custom reconfiguration controller, custom software drivers, and placement constraints. We also observe depreciation in the system’s performance by at least 6% due to placement constraints
Hardware implementation of daubechies wavelet transforms using folded AIQ mapping
The Discrete Wavelet Transform (DWT) is a popular tool in the field of image and video compression applications. Because of its multi-resolution representation capability, the DWT has been used effectively in applications such as transient signal analysis, computer vision, texture analysis, cell detection, and image compression. Daubechies wavelets are one of the popular transforms in the wavelet family. Daubechies filters provide excellent spatial and spectral locality-properties which make them useful in image compression.
In this thesis, we present an efficient implementation of a shared hardware core to compute two 8-point Daubechies wavelet transforms. The architecture is based on a new two-level folded mapping technique, an improved version of the Algebraic Integer Quantization (AIQ). The scheme is developed on the factorization and decomposition of the transform coefficients that exploits the symmetrical and wrapping structure of the matrices. The proposed architecture is parallel, pipelined, and multiplexed. Compared to existing designs, the proposed scheme reduces significantly the hardware cost, critical path delay and power consumption with a higher throughput rate.
Later, we have briefly presented a new mapping scheme to error-freely compute the Daubechies-8 tap wavelet transform, which is the next transform of Daubechies-6 in the Daubechies wavelet series. The multidimensional technique maps the irrational transformation basis coefficients with integers and results in considerable reduction in hardware and power consumption, and significant improvement in image reconstruction quality
REAL-TIME ADAPTIVE PULSE COMPRESSION ON RECONFIGURABLE, SYSTEM-ON-CHIP (SOC) PLATFORMS
New radar applications need to perform complex algorithms and process a large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low-power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression algorithms for real-time transceiver optimization is presented, and is based on a System-on-Chip architecture for reconfigurable hardware devices. This study also evaluates the performance of dedicated coprocessors as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion, which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through high-performance buses, to perform floating-point operations, control the processing blocks, and communicate with an external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band testbed together with a low-cost channel emulator for different types of waveforms
- …