86 research outputs found
High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Accelerators
Multiplication is one of the widely used arithmetic operations in a variety of applications, such as image/video processing and machine learning. FPGA vendors provide high-performance multipliers in the form of DSP blocks. These multipliers are not only limited in number and have fixed locations on FPGAs but can also create additional routing delays and may prove inefficient for smaller bit-width multiplications. Therefore, FPGA vendors additionally provide optimized soft IP cores for multiplication. However, in this work, we advocate that these soft multiplier IP cores for FPGAs still need better designs to provide high-performance and resource efficiency. Toward this, we present generic area-optimized, low-latency accurate, and approximate softcore multiplier architectures, which exploit the underlying architectural features of FPGAs, i.e., lookup table (LUT) structures and fast-carry chains to reduce the overall critical path delay (CPD) and resource utilization of multipliers. Compared to Xilinx multiplier LogiCORE IP, our proposed unsigned and signed accurate architecture provides up to 25% and 53% reduction in LUT utilization, respectively, for different sizes of multipliers. Moreover, with our unsigned approximate multiplier architectures, a reduction of up to 51% in the CPD can be achieved with an insignificant loss in output accuracy when compared with the LogiCORE IP. For illustration, we have deployed the proposed multiplier architecture in accelerators used in image and video applications, and evaluated them for area and performance gains. Our library of accurate and approximate multipliers is opensource and available online at https://cfaed.tu-dresden.de/pd-downloads to fuel further research and development in this area, facilitate reproducible research, and thereby enabling a new research direction for the FPGA community
Number Systems for Deep Neural Network Architectures: A Survey
Deep neural networks (DNNs) have become an enabling component for a myriad of
artificial intelligence applications. DNNs have shown sometimes superior
performance, even compared to humans, in cases such as self-driving, health
applications, etc. Because of their computational complexity, deploying DNNs in
resource-constrained devices still faces many challenges related to computing
complexity, energy efficiency, latency, and cost. To this end, several research
directions are being pursued by both academia and industry to accelerate and
efficiently implement DNNs. One important direction is determining the
appropriate data representation for the massive amount of data involved in DNN
processing. Using conventional number systems has been found to be sub-optimal
for DNNs. Alternatively, a great body of research focuses on exploring suitable
number systems. This article aims to provide a comprehensive survey and
discussion about alternative number systems for more efficient representations
of DNN data. Various number systems (conventional/unconventional) exploited for
DNNs are discussed. The impact of these number systems on the performance and
hardware design of DNNs is considered. In addition, this paper highlights the
challenges associated with each number system and various solutions that are
proposed for addressing them. The reader will be able to understand the
importance of an efficient number system for DNN, learn about the widely used
number systems for DNN, understand the trade-offs between various number
systems, and consider various design aspects that affect the impact of number
systems on DNN performance. In addition, the recent trends and related research
opportunities will be highlightedComment: 28 page
Design and analysis of short word length DSP systems for mobile communication
Recently, many general purpose DSP applications such as Least Mean Squares-Like single-bit adaptive filter algorithms have been developed using the Short Word Length (SWL) technique and have been shown to achieve similar performance as multi-bit systems. A key function in SWL systems is sigma delta modulation (ÎŁÎM) that operates at an over sampling ratio (OSR), in contrast to the Nyquist rate sampling typically used in conventional multi-bit systems. To date, the analysis of SWL (or single-bit) DSP systems has tended to be performed using high-level tools such as MATLAB, with little work reported relating to their hardware implementation, particularly in Field Programmable Gate Arrays (FPGAs). This thesis explores the hardware implementation of single-bit systems in FPGA using the design and implementation in VHDL of a single-bit ternary FIR-like filter as an illustrative example. The impact of varying OSR and bit-width of the SWL filter has been determined, and a comparison undertaken between the area-performance-power characteristics of the SWL FIR filter compared to its equivalent multi-bit filter. In these experiments, it was found that single-bit FIR-like filter consistently outperforms the multi-bit technique in terms of its area, performance and power except at the highest filter orders analysed in this work. At higher orders, the ÎŁÎ approach retains its power and performance advantages but exhibits slightly higher chip area. In the second stage of thesis, three encoding techniques called canonical signed digit (CSD), 2âs complement, and Redundant Binary Signed Digit (RBSD) were designed and investigated on the basis of area-performance in FPGA at varying OSR. Simulation results show that CSD encoding technique does not offer any significant improvement as compared to 2âs complement as in multi-bit domain. Whereas, RBSD occupies double the chip area than other two techniques and has poor performance. The stability of the single-bit FIR-like filter mainly depends upon IIR remodulator due to its recursive nature. Thus, we have investigated the stability IIR remodulator and propose a new model using linear analysis and root locus approach that takes into account the widely accepted second order sigma-delta modulator state variable upper bounds. Using proposed model we have found new feedback parameters limits that is a key parameter in single-bit IIR remodulator stability analysis. Further, an analysis of single-bit adaptive channel equalization in MATLAB has been performed, which is intended to support the design and development of efficient algorithm for single-bit channel equalization. A new mathematical model has been derived with all inputs, coefficients and outputs in single-bit domain. The model was simulated using narrowband signals in MATLAB and investigated on the basis of symbol error rate (SER), signal-to-noise ratio (SNR) and minimum mean squared error (MMSE). The results indicate that single-bit adaptive channel equalization is achievable with narrowband signals but that the harsh quantization noise has great impact in the convergence
PoET-BiN: Power Efficient Tiny Binary Neurons
RĂSUMĂ Le succĂšs des rĂ©seaux de neurones dans la classification des images a inspirĂ© diverses implĂ©mentations matĂ©rielles sur des systĂšmes embarquĂ©s telles que des FPGAs, des processeurs embarquĂ©s et des unitĂ©s de traitement graphiques. Ces systĂšmes sont souvent limitĂ©s en termes de puissance. Toutefois, les rĂ©seaux de neurones consomment Ă©normĂ©ment Ă travers les opĂ©rations de multiplication/accumulation et des accĂšs mĂ©moire pour la rĂ©cupĂ©ration des poids. La quantification et lâĂ©lagage ont Ă©tĂ© proposĂ©s pour rĂ©soudre ce problĂšme. Bien que efficaces, ces techniques ne prennent pas en compte lâarchitecture sous-jacente du matĂ©riel utilisĂ©. Dans ce travail, nous proposons une implĂ©mentation Ă©conome en Ă©nergie, basĂ©e sur une table de vĂ©ritĂ©, dâun neurone binaire sur des systĂšmes embarquĂ©s Ă ressources limitĂ©es. Une approche dâarbre de dĂ©cision modifiĂ©e constitue le fondement de la mise en Ćuvre proposĂ©e dans le domaine binaire. Un accĂšs de LUT consomme beaucoup moins dâĂ©nergie que lâopĂ©ration Ă©quivalente de multiplication/accumulation quâil remplace. De plus, lâalgorithme modifiĂ© de lâarbre de dĂ©cision Ă©limine le besoin dâaccĂ©der Ă la mĂ©moire. Nous avons utilisĂ© les neurones binaires proposĂ©s pour mettre en Ćuvre la couche de classification de rĂ©seaux utilisĂ©s pour la rĂ©solution des jeux de donnĂ©es MNIST, SVHN et CIFAR-10, avec des rĂ©sultats presque Ă la pointe de la technologie. La rĂ©duction de puissance pour la couche de classification atteint trois ordres de grandeur pour lâensemble de donnĂ©es MNIST et cinq ordres de grandeur pour les ensembles de donnĂ©es SVHN et CIFAR-10.----------ABSTRACT The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and pruning have been proposed to ad-dress this issue. Though effective, these techniques do not take into account the underlying architecture of the embedded hardware. In this work, we propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices. A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain. A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses. We applied the PoET-BiN architecture to implement the classification layers of networks trained on MNIST, SVHN and CIFAR-10 datasets, with near state-of-the art results. The energy reduction for the classifier portion reaches up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks
On the Exploration of FPGAs and High-Level Synthesis Capabilities on Multi-Gigabit-per-Second Networks
Tesis doctoral inĂ©dita leĂda en la Universidad AutĂłnoma de Madrid, Escuela PolitĂ©cnica Superior, Departamento de TecnologiÌa ElectroÌnica y de las Comunicaciones. Fecha de lectura: 24-01-2020Traffic on computer networks has faced an exponential grown in recent years.
Both links and communication equipment had to adapt in order to provide
a minimum quality of service required for current needs. However, in recent
years, a few factors have prevented commercial off-the-shelf hardware from
being able to keep pace with this growth rate, consequently, some software tools are
struggling to fulfill their tasks, especially at speeds higher than 10 Gbit/s. For this reason,
Field Programmable Gate Arrays (FPGAs) have arisen as an alternative to address the
most demanding tasks without the need to design an application specific integrated
circuit, this is in part to their flexibility and programmability in the field. Needless to say,
developing for FPGAs is well-known to be complex. Therefore, in this thesis we tackle
the use of FPGAs and High-Level Synthesis (HLS) languages in the context of computer
networks. We focus on the use of FPGA both in computer network monitoring application
and reliable data transmission at very high-speed. On the other hand, we intend to shed
light on the use of high level synthesis languages and boost FPGA applicability in the
context of computer networks so as to reduce development time and design complexity.
In the first part of the thesis, devoted to computer network monitoring. We take advantage
of the FPGA determinism in order to implement active monitoring probes, which
consist on sending a train of packets which is later used to obtain network parameters.
In this case, the determinism is key to reduce the uncertainty of the measurements.
The results of our experiments show that the FPGA implementations are much more
accurate and more precise than the software counterpart. At the same time, the FPGA
implementation is scalable in terms of network speed â 1, 10 and 100 Gbit/s. In the context of passive monitoring, we leverage the FPGA architecture to implement algorithms
able to thin cyphered traffic as well as removing duplicate packets. These two algorithms
straightforward in principle, but very useful to help traditional network analysis tools to
cope with their task at higher network speeds. On one hand, processing cyphered traffic
bring little benefits, on the other hand, processing duplicate traffic impacts negatively in
the performance of the software tools.
In the second part of the thesis, devoted to the TCP/IP stack. We explore the current
limitations of reliable data transmission using standard software at very high-speed.
Nowadays, the network is becoming an important bottleneck to fulfill current needs, in
particular in data centers. What is more, in recent years the deployment of 100 Gbit/s
network links has started. Consequently, there has been an increase scrutiny of how
networking functionality is deployed, furthermore, a wide range of approaches are
currently being explored to increase the efficiency of networks and tailor its functionality
to the actual needs of the application at hand. FPGAs arise as the perfect alternative to
deal with this problem. For this reason, in this thesis we develop Limago an FPGA-based
open-source implementation of a TCP/IP stack operating at 100 Gbit/s for Xilinxâs FPGAs.
Limago not only provides an unprecedented throughput, but also, provides a tiny latency
when compared to the software implementations, at least fifteen times. Limago is a key
contribution in some of the hottest topic at the moment, for instance, network-attached
FPGA and in-network data processing
Hardware support for real-time network security and packet classification using field programmable gate arrays
Deep packet inspection and packet classification are the most computationally expensive operations in a Network Intrusion Detection (NID) system. Deep packet inspection involves content matching where the payload of the incoming packets is matched against a set of signatures in the database. Packet classification involves inspection of the packet header fields and is basically a multi-dimensional matching problem. Any matching in software is very slow in comparison to current network speeds. Also, both of these problems need a solution which is scalable and can work at high speeds. Due to the high complexity of these matching problems, only Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC) platforms can facilitate efficient designs.
Two novel FPGA-based NID solutions were developed and implemented that not only carry out pattern matching at high speed but also allow changes to the set of stored patterns without resource/hardware reconfiguration; to their advantage, the solutions can easily be adopted by software or ASIC approaches as well. In both solutions, the proposed NID system can run while pattern updates occur. The designs can operate at 2.4 Gbps line rates, and have a memory consumption of around 17 bits per character and a logic cell usage of around 0.05 logic cells per character, which are the smallest compared to any other existing FPGA-based solution.
In addition to these solutions for pattern matching, a novel packet classification algorithm was developed and implemented on a FPGA. The method involves a two-field matching process at a time that then combines the constituent results to identify longer matches involving more header fields. The design can achieve a throughput larger than 9.72 Gbps and has an on-chip memory consumption of around 256Kbytes when dealing with more than 10,000 rules (without using external RAM). This memory consumption is the lowest among all the previously proposed FPGA-based designs for packet classification
- âŠ