Search CORE

86 research outputs found

High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Accelerators

Author: Kumar Akash
Rehman Semeen
Shafique Muhammad
Ullah Salim
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/02/2023
Field of study

Multiplication is one of the widely used arithmetic operations in a variety of applications, such as image/video processing and machine learning. FPGA vendors provide high-performance multipliers in the form of DSP blocks. These multipliers are not only limited in number and have fixed locations on FPGAs but can also create additional routing delays and may prove inefficient for smaller bit-width multiplications. Therefore, FPGA vendors additionally provide optimized soft IP cores for multiplication. However, in this work, we advocate that these soft multiplier IP cores for FPGAs still need better designs to provide high-performance and resource efficiency. Toward this, we present generic area-optimized, low-latency accurate, and approximate softcore multiplier architectures, which exploit the underlying architectural features of FPGAs, i.e., lookup table (LUT) structures and fast-carry chains to reduce the overall critical path delay (CPD) and resource utilization of multipliers. Compared to Xilinx multiplier LogiCORE IP, our proposed unsigned and signed accurate architecture provides up to 25% and 53% reduction in LUT utilization, respectively, for different sizes of multipliers. Moreover, with our unsigned approximate multiplier architectures, a reduction of up to 51% in the CPD can be achieved with an insignificant loss in output accuracy when compared with the LogiCORE IP. For illustration, we have deployed the proposed multiplier architecture in accelerators used in image and video applications, and evaluated them for area and performance gains. Our library of accurate and approximate multipliers is opensource and available online at https://cfaed.tu-dresden.de/pd-downloads to fuel further research and development in this area, facilitate reproducible research, and thereby enabling a new research direction for the FPGA community

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

FIR filter optimization for video processing on FPGAs

Author
Publication venue: Springer
Publication date: 25/05/2013
Field of study

Springer - Publisher Connector

Number Systems for Deep Neural Network Architectures: A Survey

Author: Al-Qutayri Mahmoud
Alsuhli Ghada
Mohammad Baker
Sakellariou Vasileios
Saleh Hani
Stouraitis Thanos
Publication venue
Publication date: 11/07/2023
Field of study

Deep neural networks (DNNs) have become an enabling component for a myriad of artificial intelligence applications. DNNs have shown sometimes superior performance, even compared to humans, in cases such as self-driving, health applications, etc. Because of their computational complexity, deploying DNNs in resource-constrained devices still faces many challenges related to computing complexity, energy efficiency, latency, and cost. To this end, several research directions are being pursued by both academia and industry to accelerate and efficiently implement DNNs. One important direction is determining the appropriate data representation for the massive amount of data involved in DNN processing. Using conventional number systems has been found to be sub-optimal for DNNs. Alternatively, a great body of research focuses on exploring suitable number systems. This article aims to provide a comprehensive survey and discussion about alternative number systems for more efficient representations of DNN data. Various number systems (conventional/unconventional) exploited for DNNs are discussed. The impact of these number systems on the performance and hardware design of DNNs is considered. In addition, this paper highlights the challenges associated with each number system and various solutions that are proposed for addressing them. The reader will be able to understand the importance of an efficient number system for DNN, learn about the widely used number systems for DNN, understand the trade-offs between various number systems, and consider various design aspects that affect the impact of number systems on DNN performance. In addition, the recent trends and related research opportunities will be highlightedComment: 28 page

arXiv.org e-Print Archive

Design and analysis of short word length DSP systems for mobile communication

Author: Memon T
Publication venue: RMIT University
Publication date: 01/01/2012
Field of study

Recently, many general purpose DSP applications such as Least Mean Squares-Like single-bit adaptive filter algorithms have been developed using the Short Word Length (SWL) technique and have been shown to achieve similar performance as multi-bit systems. A key function in SWL systems is sigma delta modulation (ΣΔM) that operates at an over sampling ratio (OSR), in contrast to the Nyquist rate sampling typically used in conventional multi-bit systems. To date, the analysis of SWL (or single-bit) DSP systems has tended to be performed using high-level tools such as MATLAB, with little work reported relating to their hardware implementation, particularly in Field Programmable Gate Arrays (FPGAs). This thesis explores the hardware implementation of single-bit systems in FPGA using the design and implementation in VHDL of a single-bit ternary FIR-like filter as an illustrative example. The impact of varying OSR and bit-width of the SWL filter has been determined, and a comparison undertaken between the area-performance-power characteristics of the SWL FIR filter compared to its equivalent multi-bit filter. In these experiments, it was found that single-bit FIR-like filter consistently outperforms the multi-bit technique in terms of its area, performance and power except at the highest filter orders analysed in this work. At higher orders, the ΣΔ approach retains its power and performance advantages but exhibits slightly higher chip area. In the second stage of thesis, three encoding techniques called canonical signed digit (CSD), 2’s complement, and Redundant Binary Signed Digit (RBSD) were designed and investigated on the basis of area-performance in FPGA at varying OSR. Simulation results show that CSD encoding technique does not offer any significant improvement as compared to 2’s complement as in multi-bit domain. Whereas, RBSD occupies double the chip area than other two techniques and has poor performance. The stability of the single-bit FIR-like filter mainly depends upon IIR remodulator due to its recursive nature. Thus, we have investigated the stability IIR remodulator and propose a new model using linear analysis and root locus approach that takes into account the widely accepted second order sigma-delta modulator state variable upper bounds. Using proposed model we have found new feedback parameters limits that is a key parameter in single-bit IIR remodulator stability analysis. Further, an analysis of single-bit adaptive channel equalization in MATLAB has been performed, which is intended to support the design and development of efficient algorithm for single-bit channel equalization. A new mathematical model has been derived with all inputs, coefficients and outputs in single-bit domain. The model was simulated using narrowband signals in MATLAB and investigated on the basis of symbol error rate (SER), signal-to-noise ratio (SNR) and minimum mean squared error (MMSE). The results indicate that single-bit adaptive channel equalization is achievable with narrowband signals but that the harsh quantization noise has great impact in the convergence

RMIT Research Repository

PoET-BiN: Power Efficient Tiny Binary Neurons

Author: Chidambaram Sivakumar
Publication venue
Publication date: 01/12/2019
Field of study

RÉSUMÉ Le succès des réseaux de neurones dans la classification des images a inspiré diverses implémentations matérielles sur des systèmes embarqués telles que des FPGAs, des processeurs embarqués et des unités de traitement graphiques. Ces systèmes sont souvent limités en termes de puissance. Toutefois, les réseaux de neurones consomment énormément à travers les opérations de multiplication/accumulation et des accès mémoire pour la récupération des poids. La quantification et l’élagage ont été proposés pour résoudre ce problème. Bien que efficaces, ces techniques ne prennent pas en compte l’architecture sous-jacente du matériel utilisé. Dans ce travail, nous proposons une implémentation économe en énergie, basée sur une table de vérité, d’un neurone binaire sur des systèmes embarqués à ressources limitées. Une approche d’arbre de décision modifiée constitue le fondement de la mise en œuvre proposée dans le domaine binaire. Un accès de LUT consomme beaucoup moins d’énergie que l’opération équivalente de multiplication/accumulation qu’il remplace. De plus, l’algorithme modifié de l’arbre de décision élimine le besoin d’accéder à la mémoire. Nous avons utilisé les neurones binaires proposés pour mettre en œuvre la couche de classification de réseaux utilisés pour la résolution des jeux de données MNIST, SVHN et CIFAR-10, avec des résultats presque à la pointe de la technologie. La réduction de puissance pour la couche de classification atteint trois ordres de grandeur pour l’ensemble de données MNIST et cinq ordres de grandeur pour les ensembles de données SVHN et CIFAR-10.----------ABSTRACT The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and pruning have been proposed to ad-dress this issue. Though effective, these techniques do not take into account the underlying architecture of the embedded hardware. In this work, we propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices. A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain. A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses. We applied the PoET-BiN architecture to implement the classification layers of networks trained on MNIST, SVHN and CIFAR-10 datasets, with near state-of-the art results. The energy reduction for the classifier portion reaches up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks

PolyPublie

Techniques for Efficient Implementation of FIR and Particle Filtering

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref

On the Exploration of FPGAs and High-Level Synthesis Capabilities on Multi-Gigabit-per-Second Networks

Author: Ruiz Noguera Mario Daniel
Publication venue
Publication date: 24/01/2020
Field of study

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 24-01-2020Traffic on computer networks has faced an exponential grown in recent years. Both links and communication equipment had to adapt in order to provide a minimum quality of service required for current needs. However, in recent years, a few factors have prevented commercial off-the-shelf hardware from being able to keep pace with this growth rate, consequently, some software tools are struggling to fulfill their tasks, especially at speeds higher than 10 Gbit/s. For this reason, Field Programmable Gate Arrays (FPGAs) have arisen as an alternative to address the most demanding tasks without the need to design an application specific integrated circuit, this is in part to their flexibility and programmability in the field. Needless to say, developing for FPGAs is well-known to be complex. Therefore, in this thesis we tackle the use of FPGAs and High-Level Synthesis (HLS) languages in the context of computer networks. We focus on the use of FPGA both in computer network monitoring application and reliable data transmission at very high-speed. On the other hand, we intend to shed light on the use of high level synthesis languages and boost FPGA applicability in the context of computer networks so as to reduce development time and design complexity. In the first part of the thesis, devoted to computer network monitoring. We take advantage of the FPGA determinism in order to implement active monitoring probes, which consist on sending a train of packets which is later used to obtain network parameters. In this case, the determinism is key to reduce the uncertainty of the measurements. The results of our experiments show that the FPGA implementations are much more accurate and more precise than the software counterpart. At the same time, the FPGA implementation is scalable in terms of network speed — 1, 10 and 100 Gbit/s. In the context of passive monitoring, we leverage the FPGA architecture to implement algorithms able to thin cyphered traffic as well as removing duplicate packets. These two algorithms straightforward in principle, but very useful to help traditional network analysis tools to cope with their task at higher network speeds. On one hand, processing cyphered traffic bring little benefits, on the other hand, processing duplicate traffic impacts negatively in the performance of the software tools. In the second part of the thesis, devoted to the TCP/IP stack. We explore the current limitations of reliable data transmission using standard software at very high-speed. Nowadays, the network is becoming an important bottleneck to fulfill current needs, in particular in data centers. What is more, in recent years the deployment of 100 Gbit/s network links has started. Consequently, there has been an increase scrutiny of how networking functionality is deployed, furthermore, a wide range of approaches are currently being explored to increase the efficiency of networks and tailor its functionality to the actual needs of the application at hand. FPGAs arise as the perfect alternative to deal with this problem. For this reason, in this thesis we develop Limago an FPGA-based open-source implementation of a TCP/IP stack operating at 100 Gbit/s for Xilinx’s FPGAs. Limago not only provides an unprecedented throughput, but also, provides a tiny latency when compared to the software implementations, at least fifteen times. Limago is a key contribution in some of the hottest topic at the moment, for instance, network-attached FPGA and in-network data processing

Biblos-e Archivo

Hardware support for real-time network security and packet classification using field programmable gate arrays

Author: Guinde Nitesh Bhicu
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2010
Field of study

Deep packet inspection and packet classification are the most computationally expensive operations in a Network Intrusion Detection (NID) system. Deep packet inspection involves content matching where the payload of the incoming packets is matched against a set of signatures in the database. Packet classification involves inspection of the packet header fields and is basically a multi-dimensional matching problem. Any matching in software is very slow in comparison to current network speeds. Also, both of these problems need a solution which is scalable and can work at high speeds. Due to the high complexity of these matching problems, only Field-Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC) platforms can facilitate efficient designs. Two novel FPGA-based NID solutions were developed and implemented that not only carry out pattern matching at high speed but also allow changes to the set of stored patterns without resource/hardware reconfiguration; to their advantage, the solutions can easily be adopted by software or ASIC approaches as well. In both solutions, the proposed NID system can run while pattern updates occur. The designs can operate at 2.4 Gbps line rates, and have a memory consumption of around 17 bits per character and a logic cell usage of around 0.05 logic cells per character, which are the smallest compared to any other existing FPGA-based solution. In addition to these solutions for pattern matching, a novel packet classification algorithm was developed and implemented on a FPGA. The method involves a two-field matching process at a time that then combines the constituent results to identify longer matches involving more header fields. The design can achieve a throughput larger than 9.72 Gbps and has an on-chip memory consumption of around 256Kbytes when dealing with more than 10,000 rules (without using external RAM). This memory consumption is the lowest among all the previously proposed FPGA-based designs for packet classification

Digital Commons @ New Jersey Institute of Technology (NJIT)