1,020 research outputs found
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
A Scalable Correlator Architecture Based on Modular FPGA Hardware, Reuseable Gateware, and Data Packetization
A new generation of radio telescopes is achieving unprecedented levels of
sensitivity and resolution, as well as increased agility and field-of-view, by
employing high-performance digital signal processing hardware to phase and
correlate large numbers of antennas. The computational demands of these imaging
systems scale in proportion to BMN^2, where B is the signal bandwidth, M is the
number of independent beams, and N is the number of antennas. The
specifications of many new arrays lead to demands in excess of tens of PetaOps
per second.
To meet this challenge, we have developed a general purpose correlator
architecture using standard 10-Gbit Ethernet switches to pass data between
flexible hardware modules containing Field Programmable Gate Array (FPGA)
chips. These chips are programmed using open-source signal processing libraries
we have developed to be flexible, scalable, and chip-independent. This work
reduces the time and cost of implementing a wide range of signal processing
systems, with correlators foremost among them,and facilitates upgrading to new
generations of processing technology. We present several correlator
deployments, including a 16-antenna, 200-MHz bandwidth, 4-bit, full Stokes
parameter application deployed on the Precision Array for Probing the Epoch of
Reionization.Comment: Accepted to Publications of the Astronomy Society of the Pacific. 31
pages. v2: corrected typo, v3: corrected Fig. 1
A Survey of FPGA Optimization Methods for Data Center Energy Efficiency
This article provides a survey of academic literature about field
programmable gate array (FPGA) and their utilization for energy efficiency
acceleration in data centers. The goal is to critically present the existing
FPGA energy optimization techniques and discuss how they can be applied to such
systems. To do so, the article explores current energy trends and their
projection to the future with particular attention to the requirements set out
by the European Code of Conduct for Data Center Energy Efficiency. The article
then proposes a complete analysis of over ten years of research in energy
optimization techniques, classifying them by purpose, method of application,
and impacts on the sources of consumption. Finally, we conclude with the
challenges and possible innovations we expect for this sector.Comment: Accepted for publication in IEEE Transactions on Sustainable
Computin
Online Timing Slack Measurement and its Application in Field-Programmable Gate Arrays
Reliability, power consumption and timing performance are key concerns for today's integrated circuits. Measurement techniques capable of quantifying the timing characteristics of a circuit, while it is operating, facilitate a range of benefits. Delay variation due to environmental and operational conditions, and degradation can be monitored by tracking changes in timing performance. Using the measurements in a closed-loop to control power supply voltage or clock frequency allows for the reduction of timing safety margins, leading to improvements in power consumption or throughput performance through the exploitation of better-than worst-case operation.
This thesis describes a novel online timing slack measurement method which can directly measure the timing performance of a circuit, accurately and with minimal overhead. Enhancements allow for the improvement of absolute accuracy and resolution. A compilation flow is reported that can automatically instrument arbitrary circuits on FPGAs with the measurement circuitry. On its own this measurement method is able to track the "health" of an integrated circuit, from commissioning through its lifetime, warning of impending failure or instigating pre-emptive degradation mitigation techniques.
The use of the measurement method in a closed-loop dynamic voltage and frequency scaling scheme has been demonstrated, achieving significant improvements in power consumption and throughput performance.Open Acces
An experimental study of reduced-voltage operation in modern FPGAs for neural network acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect ofenvironmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W ) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.The work done for this paper was partially supported by a HiPEAC Collaboration Grant funded by the H2020 HiPEAC Project under grant agreement No. 779656. The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the LEGaTO Project (www.legato-project.eu), grant agreement No. 780681.Peer ReviewedPostprint (author's final draft
Cryogenic Control Beyond 100 Qubits
Quantum computation has been a major focus of research in the past two decades, with recent experiments demonstrating basic algorithms on small numbers of qubits. A large-scale universal quantum computer would have a profound impact on science and technology, providing a solution to several problems intractable for classical computers. To realise such a machine, today's small experiments must be scaled up, and a system must be built which provides control and measurement of many hundreds of qubits. A device of this scale is challenging: qubits are highly sensitive to their environment, and sophisticated isolation techniques are required to preserve the qubits' fragile states. Solid-state qubits require deep-cryogenic cooling to suppress thermal excitations. Yet current state-of-the-art experiments use room-temperature electronics which are electrically connected to the qubits. This thesis investigates various scalable technologies and techniques which can be used to control quantum systems. With the requirements for semiconductor spin-qubits in mind, several custom electronic systems, to provide quantum control from deep cryogenic temperatures, are designed and measured. A system architecture is proposed for quantum control, providing a scalable approach to executing quantum algorithms on a large number of qubits. Control of a gallium arsenide qubit is demonstrated using a cryogenically operated FPGA driving custom gallium arsenide switches. The cryogenic performance of a commercial FPGA is measured, as the main logic processor in a cryogenic quantum control system, and digital-to-analog converters are analysed during cryogenic operation. Recent work towards a 100-qubit cryogenic control system is shown, including the design of interconnect solutions and multiplexing circuitry. With qubit fidelity over the fault-tolerant threshold for certain error correcting codes, accompanying control platforms will play a key role in the development of a scalable quantum machine
Data Provenance and Management in Radio Astronomy: A Stream Computing Approach
New approaches for data provenance and data management (DPDM) are required
for mega science projects like the Square Kilometer Array, characterized by
extremely large data volume and intense data rates, therefore demanding
innovative and highly efficient computational paradigms. In this context, we
explore a stream-computing approach with the emphasis on the use of
accelerators. In particular, we make use of a new generation of high
performance stream-based parallelization middleware known as InfoSphere
Streams. Its viability for managing and ensuring interoperability and integrity
of signal processing data pipelines is demonstrated in radio astronomy. IBM
InfoSphere Streams embraces the stream-computing paradigm. It is a shift from
conventional data mining techniques (involving analysis of existing data from
databases) towards real-time analytic processing. We discuss using InfoSphere
Streams for effective DPDM in radio astronomy and propose a way in which
InfoSphere Streams can be utilized for large antennae arrays. We present a
case-study: the InfoSphere Streams implementation of an autocorrelating
spectrometer, and using this example we discuss the advantages of the
stream-computing approach and the utilization of hardware accelerators
Reconfigurable Architectures and Systems for IoT Applications
abstract: Internet of Things (IoT) has become a popular topic in industry over the recent years, which describes an ecosystem of internet-connected devices or things that enrich the everyday life by improving our productivity and efficiency. The primary components of the IoT ecosystem are hardware, software and services. While the software and services of IoT system focus on data collection and processing to make decisions, the underlying hardware is responsible for sensing the information, preprocess and transmit it to the servers. Since the IoT ecosystem is still in infancy, there is a great need for rapid prototyping platforms that would help accelerate the hardware design process. However, depending on the target IoT application, different sensors are required to sense the signals such as heart-rate, temperature, pressure, acceleration, etc., and there is a great need for reconfigurable platforms that can prototype different sensor interfacing circuits.
This thesis primarily focuses on two important hardware aspects of an IoT system: (a) an FPAA based reconfigurable sensing front-end system and (b) an FPGA based reconfigurable processing system. To enable reconfiguration capability for any sensor type, Programmable ANalog Device Array (PANDA), a transistor-level analog reconfigurable platform is proposed. CAD tools required for implementation of front-end circuits on the platform are also developed. To demonstrate the capability of the platform on silicon, a small-scale array of 24×25 PANDA cells is fabricated in 65nm technology. Several analog circuit building blocks including amplifiers, bias circuits and filters are prototyped on the platform, which demonstrates the effectiveness of the platform for rapid prototyping IoT sensor interfaces.
IoT systems typically use machine learning algorithms that run on the servers to process the data in order to make decisions. Recently, embedded processors are being used to preprocess the data at the energy-constrained sensor node or at IoT gateway, which saves considerable energy for transmission and bandwidth. Using conventional CPU based systems for implementing the machine learning algorithms is not energy-efficient. Hence an FPGA based hardware accelerator is proposed and an optimization methodology is developed to maximize throughput of any convolutional neural network (CNN) based machine learning algorithm on a resource-constrained FPGA.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
Recommended from our members
Efficient FPGA implementation and power modelling of image and signal processing IP cores
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage
and signal processing application areas such as consumer electronics, instrumentation,
medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA
devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the
work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of
cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area.
A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM
is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed
Recommended from our members
High-Speed Wide-Field Time-Correlated Single-Photon Counting Fluorescence Lifetime Imaging Microscopy
Fluorescence microscopy is a powerful imaging technique used in the biological sciences to identify labeled components of a sample with specificity. This is usually accomplished through labeling with fluorescent dyes, isolating these dyes by their spectral signatures with optical filters, and recording the intensity of the fluorescent response. Although these techniques are widely used, fluorescence intensity images can be negatively affected by a variety of factors that impact the fluorescence intensity. Fluorescence lifetime imaging microscopy (FLIM) is an imaging technique that is relatively immune to intensity fluctuations and also provides the unique ability to directly monitor the microenvironment surrounding a fluorophore. Despite the benefits associated with FLIM, the applications to which it is applied are fairly limited due to long image acquisition times and high cost of traditional hardware. Recent advances in complementary metal-oxide-semiconductor (CMOS) single-photon avalanche diodes (SPADs) have enabled the design of low-cost imaging arrays that are capable of recording lifetime images with acquisition times greater than one order of magnitude faster than existing systems. However, these SPAD arrays have yet to realize the full potential of the technology due to limitations in their ability to handle the vast amount of data generated during the commonly used time-correlated single-photon counting (TCSPC) lifetime imaging technique. This thesis presents the design, implementation, characterization, and demonstration of a high speed FLIM imaging system. The components of this design include a CMOS imager chip in a standard 0.13 μm technology containing a custom CMOS SPAD, a 64-by-64 array of these SPADs, pixel control circuitry, independent time-to-digital converters (TDCs), a FLIM specific datapath, and high bandwidth output buffers. In addition to the CMOS imaging array, a complete system was designed and implemented using a printed circuit board (PCB) for capturing data from the imager, creating histograms for the photon arrival data using field-programmable gate arrays, and transferring the data to a computer using a cabled PCIe interface. Finally, software is used to communicate between the imaging system and a computer.The dark count rate of the SPAD was measured to be only 231 Hz at room temperature while maintaining a photon detection probability of up to 30\%. TDCs included on the array have a 62.5 ps resolution and a 64 ns range, which is suitable for measuring the lifetime of most biological fluorophores. Additionally, the on-chip datapath was designed to handle continuous data transfers at rates capable of supporting TCSPC-based lifetime imaging at 100 frames per second. The system level implementation also provides sufficient data throughput for transferring up to 750 frames per second from the imaging system to a computer. The lifetime imaging system was characterized using standard techniques for evaluating SPAD performance and an electrical delay signal for measuring the TDC performance. This thesis concludes with a demonstration of TCSPC-FLIM imaging at 100 frames per second -- the fastest 64-by-64 TCSPC FLIM that has been demonstrated. This system overcomes some of the limitations of existing FLIM systems and has the potential to enable new application domains in dynamic FLIM imaging
- …