Search CORE

950 research outputs found

PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Author: Andronic Marta
Constantinides George A.
Publication venue
Publication date: 05/09/2023
Field of study

Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with zero overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset

arXiv.org e-Print Archive

Recommended from our members

Hardward and algorithm architectures for real-time additive synthesis

Author: Symons Peter Robert
Publication venue
Publication date: 01/01/2005
Field of study

Additive synthesis is a fundamental computer music synthesis paradigm tracing its origins to the work of Fourier and Helmholtz. Rudimentary implementation linearly combines harmonic sinusoids (or partials) to generate tones whose perceived timbral characteristics are a strong function of the partial amplitude spectrum. Having evolved over time, additive synthesis describes a collection of algorithms each characterised by the time-varying linear combination of basis components to generate temporal evolution of timbre. Basis components include exactly harmonic partials, inharmonic partials with time-varying frequency or non-sinusoidal waveforms each with distinct spectral characteristics. Additive synthesis of polyphonic musical instrument tones requires a large number of independently controlled partials incurring a large computational overhead whose investigation and reduction is a key motivator for this work. The thesis begins with a review of prevalent synthesis techniques setting additive synthesis in context and introducing the spectrum modelling paradigm which provides baseline spectral data to the additive synthesis process obtained from the analysis of natural sounds. We proceed to investigate recursive and phase accumulating digital sinusoidal oscillator algorithms, defining specific metrics to quantify relative performance. The concepts of phase accumulation, table lookup phase-amplitude mapping and interpolated fractional addressing are introduced and developed and shown to underpin an additive synthesis subclass - wavetable lookup synthesis (WLS). WLS performance is simulated against specific metrics and parameter conditions peculiar to computer music requirements. We conclude by presenting processing architectures which accelerate computational throughput of specific WLS operations and the sinusoidal additive synthesis model. In particular, we introduce and investigate the concept of phase domain processing and present several “pipeline friendly” arithmetic architectures using this technique which implement the additive synthesis of sinusoidal partials

Open Research Online (The Open University)

OpenGrey Repository

Synchronization Technique for OFDM-Based UWB System

Author: Choy Chiu-Sing
Fan Wen
Publication venue: 'IntechOpen'
Publication date: 27/07/2011
Field of study

IntechOpen

Multipartite table methods

Author: de Dinechin Florent
Tisserand Arnaud
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/01/2005
Field of study

International audienceA unified view of most previous table-lookup-and-addition methods (bipartite tables, SBTM, STAM, and multipartite methods) is presented. This unified view allows a more accurate computation of the error entailed by these methods, which enables a wider design space exploration, leading to tables smaller than the best previously published ones by up to 50 percent. The synthesis of these multipartite architectures on Virtex FPGAs is also discussed. Compared to other methods involving multipliers, the multipartite approach offers the best speed/area tradeoff for precisions up to 16 bits. A reference implementation is available at www.ens-lyon.fr/LIP/Arenaire/

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Recommended from our members

Finite element analysis of small-scale hot compression testing

Author: Jedrasiak Patryk
Shercliff Hugh
Publication venue: Journal of Materials Science & Technology
Publication date: 15/03/2022
Field of study

This paper models hot compression testing using a dilatometer in loading mode. These small-scale tests provide a high throughput at low cost, but are susceptible to inhomogeneity due to friction and temperature gradients. A novel method is presented for correcting the true stress-strain constitutive response over the full range of temperatures, strain-rates and strain. The nominal response from the tests is used to predict the offset in the stress-strain curves due to inhomogeneity, and this stress offset Δσ is applied piecewise to the data, correcting the constitutive response in one iteration. A key new feature is the smoothing and fitting of the flow stress data as a function of temperature and strain-rate, at multiple discrete strains. The corrected model then provides quantitative prediction of the spatial and temporal variation in strain-rate and strain throughout the sample, needed to correlate the local deformation conditions with the microstructure and texture evolution. The study uses a detailed series of 144 hot compression tests of a Zr-Nb alloy. While this is an important wrought nuclear alloy in its own right, it also serves here as a test case for modelling the dilatometer for hot testing of high temperature alloys, particularly those with dual α-β phase microstructures (such as titanium alloys)

Apollo (Cambridge)

Architecture-Preserving Provable Repair of Deep Neural Networks

Author: Mitchell Jacqueline
Nawas Stephanie
Tao Zhe
Thakur Aditya V.
Publication venue
Publication date: 16/08/2023
Field of study

Deep neural networks (DNNs) are becoming increasingly important components of software, and are considered the state-of-the-art solution for a number of problems, such as image recognition. However, DNNs are far from infallible, and incorrect behavior of DNNs can have disastrous real-world consequences. This paper addresses the problem of architecture-preserving V-polytope provable repair of DNNs. A V-polytope defines a convex bounded polytope using its vertex representation. V-polytope provable repair guarantees that the repaired DNN satisfies the given specification on the infinite set of points in the given V-polytope. An architecture-preserving repair only modifies the parameters of the DNN, without modifying its architecture. The repair has the flexibility to modify multiple layers of the DNN, and runs in polynomial time. It supports DNNs with activation functions that have some linear pieces, as well as fully-connected, convolutional, pooling and residual layers. To the best our knowledge, this is the first provable repair approach that has all of these features. We implement our approach in a tool called APRNN. Using MNIST, ImageNet, and ACAS Xu DNNs, we show that it has better efficiency, scalability, and generalization compared to PRDNN and REASSURE, prior provable repair methods that are not architecture preserving.Comment: Accepted paper at PLDI 2023. Tool is available at https://github.com/95616ARG/APRNN

arXiv.org e-Print Archive

Fast Visualization by Shear-Warp using Spline Models for Data Reconstruction

Author: Schlosser Gregor
Publication venue: Universität Mannheim
Publication date: 01/01/2009
Field of study

This work concerns oneself with the rendering of huge three-dimensional data sets. The target thereby is the development of fast algorithms by also applying recent and accurate volume reconstruction models to obtain at most artifact-free data visualizations. In part I a comprehensive overview on the state of the art in volume rendering is given. Part II is devoted to the recently developed trivariate (linear,) quadratic and cubic spline models defined on symmetric tetrahedral partitions directly obtained by slicing volumetric partitions of a three-dimensional domain. This spline models define piecewise polynomials of total degree (one,) two and three with respect to a tetrahedron, i.e. the local splines have the lowest possible total degree and are adequate for efficient and accurate volume visualization. The following part III depicts in a step by step manner a fast software-based rendering algorithm, called shear-warp. This algorithm is prominent for its ability to generate projections of volume data at real time. It attains the high rendering speed by using elaborate data structures and extensive pre-computation, but at the expense of data redundancy and visual quality of the finally obtained rendering results. However, to circumvent these disadvantages a further development is specified, where new techniques and sophisticated data structures allow combining the fast shear-warp with the accurate ray-casting approach. This strategy and the new data structures not only grant a unification of the benefits of both methods, they even easily admit for adjustments to trade-off between rendering speed and precision. With this further development also the 3-fold data redundancy known from the original shear-warp approach is removed, allowing the rendering of even larger three-dimensional data sets more quickly. Additionally, real trivariate data reconstruction models, as discussed in part II, are applied together with the new ideas to onward the precision of the new volume rendering method, which also lead to a one order of magnitude faster algorithm compared to traditional approaches using similar reconstruction models. In part IV, a hierarchy-based rendering method is developed which utilizes a wavelet decomposition of the volume data, an octree structure to represent the sparse data set, the splines from part II and a new shear-warp visualization algorithm similar to that presented in part III. This thesis is concluded by the results centralized in part V

MAnnheim DOCument Server

An Efficient Hardware Implementation of LDPC Decoder

Author: Yasoubi Monazzahalsadat
Publication venue
Publication date: 13/02/2020
Field of study

Reliable communication over noisy channel is an old but still challenging issues for communication engineers. Low density parity check codes (LDPC) are linear block codes proposed by Robert G. Gallager in 1960. LDPC codes have lesser complexity compared to Turbo-codes. In most recent wireless communication standard, LDPC is used as one of the most popular forward error correction (FEC) codes due to their excellent error-correcting capability. In this thesis we focus on hardware implementation of the LDPC used in Digital Video Broadcasting - Satellite - Second Generation (DVB-S2) standard ratified in 2005. In architecture design of LDPC decoder, because of the structure of DVB-S2, a memory mapping scheme is used that allows 360 functional units implement simultaneously. The functional units are optimized to reduce hardware resource utilization on an FPGA. A novel design of Range addressable look up table (RALUT) for hyperbolic tangent function is proposed that simplifies the LDPC decoding algorithm while the performance remains the same. Commonly, RALUTs are uniformly distributed on input, however, in our proposed method, instead of representing the LUT input uniformly, we use a non-uniform scale assigning more values to those near zero. Zynq XC7Z030, a family of FPGA’s, is used for Evaluation of the complexity of the proposed design. Synthesizes result show the speed increase due to use of LUT method, however, LUT demand more memory. Thus, we decrease the usage of resource by applying RALUT method

Concordia University Research Repository