166 research outputs found
Impact of the hardened floating-point cores on HIL technology
The Hardware-In-the-Loop (HIL) technique is increasingly used for testing power electronics. FPGAs (Field-Programmable Gate Array) are becoming usual in this kind of emulation due to their acceleration capabilities. But even using FPGAs, it has not been possible to reach real time simulations when small integration steps are necessary (around 100 ns or lower) if floating-point representation is used. Fixed-point has been the solution, but at a high design effort cost. With the release of FPGAs with HFP (Hardened Floating-Point) cores â dedicated floating-point blocks implemented in silicon â the minimum achievable simulation step decreases significantly. This paper presents a comparison between HFP cores, floating-point in programmable logic and fixed-point for HIL models. Results show that both HFP-based and fixed-point arithmetic achieve a simulation step around 10 ns for a full-bridge converter model. A comparison regarding resolution and accuracy is also presented, because acceleration is not the only issue when decreasing the integration step. Numerical resolution also plays an important role, and 32-bit floating-point representation finds a double barrier: acceleration marked by technology, and numerical resolution. Both are explored in this paperThis work has been supported by the Spanish Ministerio deEconomĂa y Competitividad under project TEC2013-43017-
FPGA Implementation of Double Precision Floating Point Multiplier
High speed computation is the need of todayâs generation of Processors. To accomplish this major task, many functions are implemented inside the hardware of the processor rather than having software computing the same task. Majority of the operations which the processor executes are Arithmetic operations which are widely used in many applications that require heavy mathematical operations such as scientific calculations, image and signal processing. Especially in the field of signal processing, multiplication division operation is widely used in many applications. The major issue with these operations in hardware is that much iteration is required which results in slow operation while fast algorithms require complex computations within each cycle. The result of a Division operation results in a either in Quotient and Remainder or a Floating point number which is the major reason to make it more complex than Multiplication operation
LOCOFloat: A low-cost floating-point format for FPGAs.: Application to HIL simulators
One of the main decisions when making a digital design is which arithmetic is going
to be used. The arithmetic determines the hardware resources needed and the latency of every
operation. This is especially important in real-time applications like HIL (Hardware-in-the-loop),
where a real-time simulation of a plantâpower converter, mechanical system, or any other complex
systemâis accomplished. While a fixed-point gets optimal implementations, using considerably
fewer resources and allowing smaller simulation steps, its use is very restricted to very specific
applications, as its design effort is quite high. On the other side, IEEE-754 floating-point may
have resolution problems in case of the 32-bit version, and excessive hardware usage in case of the
64-bit version. This paper presents LOCOFloat, a low-cost floating-point format designed for FPGA
applications. Its key features are soft normalization of the results, using significand and exponent
fields in twoâs complement. This paper shows the implementation of addition, subtraction and
multiplication of the proposed format. Both IEEE-754 versions and LOCOFloat are compared in
this paper, implementing a HIL model of a buck converter. Although the application example is a
HIL simulator, other applications could take benefit from the proposed format. Results show that
LOCOFloat is as accurate as 64-bit floating-point, while reducing the use of DSPs blocks by 84%
Optimising algorithm and hardware for deep neural networks on FPGAs
This thesis proposes novel algorithm and hardware optimisation approaches to accelerate Deep Neural Networks (DNNs), including both Convolutional Neural Networks (CNNs) and Bayesian Neural Networks (BayesNNs).
The first contribution of this thesis is to propose an adaptable and reconfigurable hardware design to accelerate CNNs. By analysing the computational patterns of different CNNs, a unified hardware architecture is proposed for both 2-Dimension and 3-Dimension CNNs. The accelerator is also designed with runtime adaptability, which adopts different parallelism strategies for different convolutional layers at runtime.
The second contribution of this thesis is to propose a novel neural network architecture and hardware design co-optimisation approach, which improves the performance of CNNs at both algorithm and hardware levels. Our proposed three-phase co-design framework decouples network training from design space exploration, which significantly reduces the time-cost of the co-optimisation process.
The third contribution of this thesis is to propose an algorithmic and hardware co-optimisation framework for accelerating BayesNNs. At the algorithmic level, three categories of structured sparsity are explored to reduce the computational complexity of BayesNNs. At the hardware level, we propose a novel hardware architecture with the aim of exploiting the structured sparsity for BayesNNs. Both algorithmic and hardware optimisations are jointly applied to push the performance limit.Open Acces
FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods
Background Likelihood (ML)-based phylogenetic inference has become a popular method for estimating the evolutionary relationships among species based on genomic sequence data. This method is used in applications such as RAxML, GARLI, MrBayes, PAML, and PAUP. The Phylogenetic Likelihood Function (PLF) is an important kernel computation for this method. The PLF consists of a loop with no conditional behavior or dependencies between iterations. As such it contains a high potential for exploiting parallelism using micro-architectural techniques. In this paper, we describe a technique for mapping the PLF and supporting logic onto a Field Programmable Gate Array (FPGA)-based co-processor. By leveraging the FPGA\u27s on-chip DSP modules and the high-bandwidth local memory attached to the FPGA, the resultant co-processor can accelerate ML-based methods and outperform state-of-the-art multi-core processors.
Results We use the MrBayes 3 tool as a framework for designing our co-processor. For large datasets, we estimate that our accelerated MrBayes, if run on a current-generation FPGA, achieves a 10Ă speedup relative to software running on a state-of-the-art server-class microprocessor. The FPGA-based implementation achieves its performance by deeply pipelining the likelihood computations, performing multiple floating-point operations in parallel, and through a natural log approximation that is chosen specifically to leverage a deeply pipelined custom architecture.
Conclusions Heterogeneous computing, which combines general-purpose processors with special-purpose co-processors such as FPGAs and GPUs, is a promising approach for high-performance phylogeny inference as shown by the growing body of literature in this field. FPGAs in particular are well-suited for this task because of their low power consumption as compared to many-core processors and Graphics Processor Units (GPUs)
Real -time Retinex image enhancement: Algorithm and architecture optimizations
The field of digital image processing encompasses the study of algorithms applied to two-dimensional digital images, such as photographs, or three-dimensional signals, such as digital video. Digital image processing algorithms are generally divided into several distinct branches including image analysis, synthesis, segmentation, compression, restoration, and enhancement. One particular image enhancement algorithm that is rapidly gaining widespread acceptance as a near optimal solution for providing good visual representations of scenes is the Retinex.;The Retinex algorithm performs a non-linear transform that improves the brightness, contrast and sharpness of an image. It simultaneously provides dynamic range compression, color constancy, and color rendition. It has been successfully applied to still imagery---captured from a wide variety of sources including medical radiometry, forensic investigations, and consumer photography. Many potential users require a real-time implementation of the algorithm. However, prior to this research effort, no real-time version of the algorithm had ever been achieved.;In this dissertation, we research and provide solutions to the issues associated with performing real-time Retinex image enhancement. We design, develop, test, and evaluate the algorithm and architecture optimizations that we developed to enable the implementation of the real-time Retinex specifically targeting specialized, embedded digital signal processors (DSPs). This includes optimization and mapping of the algorithm to different DSPs, and configuration of these architectures to support real-time processing.;First, we developed and implemented the single-scale monochrome Retinex on a Texas Instruments TMS320C6711 floating-point DSP and attained 21 frames per second (fps) performance. This design was then transferred to the faster TMS320C6713 floating-point DSP and ran at 28 fps. Then we modified our design for the fixed-point TMS320DM642 DSP and achieved an execution rate of 70 fps. Finally, we migrated this design to the fixed-point TMS320C6416 DSP. After making several additional optimizations and exploiting the enhanced architecture of the TMS320C6416, we achieved 108 fps and 20 fps performance for the single-scale, monochrome Retinex and three-scale, color Retinex, respectively. We also applied a version of our real-time Retinex in an Enhanced Vision System. This provides a general basis for using the algorithm in other applications
FPGA Based Acceleration of Matrix Decomposition and Clustering Algorithm Using High Level Synthesis
FPGAs have shown great promise for accelerating computationally intensive algorithms. However, FPGA-based accelerator design is tedious and time consuming if we rely on traditional HDL based design method. Recent introduction of Altera SDK for OpenCL (AOCL) high level synthesis tool enables developers to utilize FPGAâs potential without long development time and extensive hardware knowledge. AOCL is used in this thesis to accelerate computationally intensive algorithms in the field of machine learning and scientific computing. The algorithms studied are k-means clustering, k-nearest neighbour search, N-body simulation and LU decomposition. The performance and power consumption of the algorithms synthesized using AOCL for FPGA are evaluated against state of the art CPU and GPU implementations. The k-means clustering and k-nearest neighbor kernels designed for FPGA significantly out-performed optimized CPU implementations while achieving similar or better power efficiency than that of GPU
Conceptual design and realization of a dynamic partial reconfiguration extension of an existing soft-core processor
Viele aktuelle Field Programmable Gate Arrays (FPGAs) unterstĂŒtzen die Technik der partiellen Rekonfiguration (PR), durch die dynamisch zur Laufzeit ein Hardware-Design auch nur teilweise ausgetauscht werden kann. Die vorliegende Arbeit integriert PR-FunktionalitĂ€t in die an der Technischen UniversitĂ€t Ilmenau fĂŒr harte Echtzeitaufgaben mit hochprĂ€zisen FlieĂkommaberechnungen entwickelte VHDL Integrated Softcore Architecture for Reconfigurable Devices (ViSARD). Zu diesem Zweck wird die arithmetisch-logische Einheit angepasst, um das Auswechseln von FlieĂkomma-AusfĂŒhrungseinheiten zu ermöglichen. Ziele der Entwicklung des PR-Systems sind hohe Geschwindigkeit, niedrige Latenz, niedrige Ressourcenkosten und harte EchtzeitfĂ€higkeit. Erreicht werden diese durch die Umsetzung einer eigenen Steuereinheit (partial reconfiguration controller), die partielle Bitströme aus externem RAM ĂŒber einen standardmĂ€Ăigen AXI-Bus lĂ€dt sowie die entsprechende Erweiterung der ViSARD. In einem Testdesign, das zwischen drei verschiedenen Konfigurationen mit je zwischen einer und drei AusfĂŒhrungseinheiten wechselt, hat das entwickelte PR-System den maximal spezifierten Bitstromdurchsatz auf dem Ziel-FPGA erreicht und den Verbrauch an Lookup-Tabellen um etwa 40 % verringert.Many modern field-programmable gate arrays (FPGAs) support partial reconfiguration, which allows to dynamically replace only a part of a design at run time. In this thesis, partial reconfiguration capability is integrated with the VHDL Integrated Softcore Architecture for Reconfigurable Devices (ViSARD) developed at Technische UniversitĂ€t Ilmenau and conceived for hard real-time tasks requiring floating-point calculations with high precision. Specifically, its arithmetic logic unit is modified to allow exchanging floating-point arithmetic execution units. Design goals of the partial reconfiguration system are high speed, low latency, low resource overhead, and hard real-time capability. They are reached by implementing a custom partial reconfiguration controller loading partial bitstreams from external RAM over a standard AXI bus and extending the ViSARD appropriately. In a test design that switched between 3 different configurations each containing between 1 and 3 execution units, the proposed partial reconfiguration system achieved the maximum specified bitstream throughput on the target FPGA and allowed for roughly 40 % reduced look-up table usage
Number Systems for Deep Neural Network Architectures: A Survey
Deep neural networks (DNNs) have become an enabling component for a myriad of
artificial intelligence applications. DNNs have shown sometimes superior
performance, even compared to humans, in cases such as self-driving, health
applications, etc. Because of their computational complexity, deploying DNNs in
resource-constrained devices still faces many challenges related to computing
complexity, energy efficiency, latency, and cost. To this end, several research
directions are being pursued by both academia and industry to accelerate and
efficiently implement DNNs. One important direction is determining the
appropriate data representation for the massive amount of data involved in DNN
processing. Using conventional number systems has been found to be sub-optimal
for DNNs. Alternatively, a great body of research focuses on exploring suitable
number systems. This article aims to provide a comprehensive survey and
discussion about alternative number systems for more efficient representations
of DNN data. Various number systems (conventional/unconventional) exploited for
DNNs are discussed. The impact of these number systems on the performance and
hardware design of DNNs is considered. In addition, this paper highlights the
challenges associated with each number system and various solutions that are
proposed for addressing them. The reader will be able to understand the
importance of an efficient number system for DNN, learn about the widely used
number systems for DNN, understand the trade-offs between various number
systems, and consider various design aspects that affect the impact of number
systems on DNN performance. In addition, the recent trends and related research
opportunities will be highlightedComment: 28 page
- âŠ