950 research outputs found
PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference
Field-programmable gate arrays (FPGAs) are widely used to implement deep
learning inference. Standard deep neural network inference involves the
computation of interleaved linear maps and nonlinear activation functions.
Prior work for ultra-low latency implementations has hardcoded the combination
of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our
work is motivated by the idea that the LUTs in an FPGA can be used to implement
a much greater variety of functions than this. In this paper, we propose a
novel approach to training neural networks for FPGA deployment using
multivariate polynomials as the basic building block. Our method takes
advantage of the flexibility offered by the soft logic, hiding the polynomial
evaluation inside the LUTs with zero overhead. We show that by using polynomial
building blocks, we can achieve the same accuracy using considerably fewer
layers of soft logic than by using linear functions, leading to significant
latency and area improvements. We demonstrate the effectiveness of this
approach in three tasks: network intrusion detection, jet identification at the
CERN Large Hadron Collider, and handwritten digit recognition using the MNIST
dataset
Recommended from our members
Hardward and algorithm architectures for real-time additive synthesis
Additive synthesis is a fundamental computer music synthesis paradigm tracing its origins to the work of Fourier and Helmholtz. Rudimentary implementation linearly combines harmonic sinusoids (or partials) to generate tones whose perceived timbral characteristics are a strong function of the partial amplitude spectrum. Having evolved over time, additive synthesis describes a collection of algorithms each characterised by the time-varying linear combination of basis components to generate temporal evolution of timbre. Basis components include exactly harmonic partials, inharmonic partials with time-varying frequency or non-sinusoidal waveforms each with distinct spectral characteristics. Additive synthesis of polyphonic musical instrument tones requires a large number of independently controlled partials incurring a large computational overhead whose investigation and reduction is a key motivator for this work. The thesis begins with a review of prevalent synthesis techniques setting additive synthesis in context and introducing the spectrum modelling paradigm which provides baseline spectral data to the additive synthesis process obtained from the analysis of natural sounds. We proceed to investigate recursive and phase accumulating digital sinusoidal oscillator algorithms, defining specific metrics to quantify relative performance. The concepts of phase accumulation, table lookup phase-amplitude mapping and interpolated fractional addressing are introduced and developed and shown to underpin an additive synthesis subclass - wavetable lookup synthesis (WLS). WLS performance is simulated against specific metrics and parameter conditions peculiar to computer music requirements. We conclude by presenting processing architectures which accelerate computational throughput of specific WLS operations and the sinusoidal additive synthesis model. In particular, we introduce and investigate the concept of phase domain processing and present several “pipeline friendly” arithmetic architectures using this technique which implement the additive synthesis of sinusoidal partials
Multipartite table methods
International audienceA unified view of most previous table-lookup-and-addition methods (bipartite tables, SBTM, STAM, and multipartite methods) is presented. This unified view allows a more accurate computation of the error entailed by these methods, which enables a wider design space exploration, leading to tables smaller than the best previously published ones by up to 50 percent. The synthesis of these multipartite architectures on Virtex FPGAs is also discussed. Compared to other methods involving multipliers, the multipartite approach offers the best speed/area tradeoff for precisions up to 16 bits. A reference implementation is available at www.ens-lyon.fr/LIP/Arenaire/
Recommended from our members
Finite element analysis of small-scale hot compression testing
This paper models hot compression testing using a dilatometer in loading mode. These small-scale tests provide a high throughput at low cost, but are susceptible to inhomogeneity due to friction and temperature gradients. A novel method is presented for correcting the true stress-strain constitutive response over the full range of temperatures, strain-rates and strain. The nominal response from the tests is used to predict the offset in the stress-strain curves due to inhomogeneity, and this stress offset Δσ is applied piecewise to the data, correcting the constitutive response in one iteration. A key new feature is the smoothing and fitting of the flow stress data as a function of temperature and strain-rate, at multiple discrete strains. The corrected model then provides quantitative prediction of the spatial and temporal variation in strain-rate and strain throughout the sample, needed to correlate the local deformation conditions with the microstructure and texture evolution. The study uses a detailed series of 144 hot compression tests of a Zr-Nb alloy. While this is an important wrought nuclear alloy in its own right, it also serves here as a test case for modelling the dilatometer for hot testing of high temperature alloys, particularly those with dual α-β phase microstructures (such as titanium alloys)
Architecture-Preserving Provable Repair of Deep Neural Networks
Deep neural networks (DNNs) are becoming increasingly important components of
software, and are considered the state-of-the-art solution for a number of
problems, such as image recognition. However, DNNs are far from infallible, and
incorrect behavior of DNNs can have disastrous real-world consequences. This
paper addresses the problem of architecture-preserving V-polytope provable
repair of DNNs. A V-polytope defines a convex bounded polytope using its vertex
representation. V-polytope provable repair guarantees that the repaired DNN
satisfies the given specification on the infinite set of points in the given
V-polytope. An architecture-preserving repair only modifies the parameters of
the DNN, without modifying its architecture. The repair has the flexibility to
modify multiple layers of the DNN, and runs in polynomial time. It supports
DNNs with activation functions that have some linear pieces, as well as
fully-connected, convolutional, pooling and residual layers. To the best our
knowledge, this is the first provable repair approach that has all of these
features. We implement our approach in a tool called APRNN. Using MNIST,
ImageNet, and ACAS Xu DNNs, we show that it has better efficiency, scalability,
and generalization compared to PRDNN and REASSURE, prior provable repair
methods that are not architecture preserving.Comment: Accepted paper at PLDI 2023. Tool is available at
https://github.com/95616ARG/APRNN
Fast Visualization by Shear-Warp using Spline Models for Data Reconstruction
This work concerns oneself with the rendering of huge three-dimensional data sets. The target thereby is the development of fast algorithms by also applying recent and accurate volume reconstruction models to obtain at most artifact-free data visualizations. In part I a comprehensive overview on the state of the art in volume rendering is given. Part II is devoted to the recently developed trivariate (linear,) quadratic and cubic spline models defined on symmetric tetrahedral partitions directly obtained by slicing volumetric partitions of a three-dimensional domain. This spline models define piecewise polynomials of total degree (one,) two and three with respect to a tetrahedron, i.e. the local splines have the lowest possible total degree and are adequate for efficient and accurate volume visualization. The following part III depicts in a step by step manner a fast software-based rendering algorithm, called shear-warp. This algorithm is prominent for its ability to generate projections of volume data at real time. It attains the high rendering speed by using elaborate data structures and extensive pre-computation, but at the expense of data redundancy and visual quality of the finally obtained rendering results. However, to circumvent these disadvantages a further development is specified, where new techniques and sophisticated data structures allow combining the fast shear-warp with the accurate ray-casting approach. This strategy and the new data structures not only grant a unification of the benefits of both methods, they even easily admit for adjustments to trade-off between rendering speed and precision. With this further development also the 3-fold data redundancy known from the original shear-warp approach is removed, allowing the rendering of even larger three-dimensional data sets more quickly. Additionally, real trivariate data reconstruction models, as discussed in part II, are applied together with the new ideas to onward the precision of the new volume rendering method, which also lead to a one order of magnitude faster algorithm compared to traditional approaches using similar reconstruction models. In part IV, a hierarchy-based rendering method is developed which utilizes a wavelet decomposition of the volume data, an octree structure to represent the sparse data set, the splines from part II and a new shear-warp visualization algorithm similar to that presented in part III. This thesis is concluded by the results centralized in part V
An Efficient Hardware Implementation of LDPC Decoder
Reliable communication over noisy channel is an old but still challenging issues for communication engineers. Low density parity check codes (LDPC) are linear block codes proposed by Robert G. Gallager in 1960. LDPC codes have lesser complexity compared to Turbo-codes. In most recent wireless communication standard, LDPC is used as one of the most popular forward error correction (FEC) codes due to their excellent error-correcting capability. In this thesis we focus on hardware implementation of the LDPC used in Digital Video Broadcasting - Satellite - Second Generation (DVB-S2) standard ratified in 2005. In architecture design of LDPC decoder, because of the structure of DVB-S2, a memory mapping scheme is used that allows 360 functional units implement simultaneously. The functional units are optimized to reduce hardware resource utilization on an FPGA. A novel design of Range addressable look up table (RALUT) for hyperbolic tangent function is proposed that simplifies the LDPC decoding algorithm while the performance remains the same. Commonly, RALUTs are uniformly distributed on input, however, in our proposed method, instead of representing the LUT input uniformly, we use a non-uniform scale assigning more values to those near zero. Zynq XC7Z030, a family of FPGA’s, is used for Evaluation of the complexity of the proposed design. Synthesizes result show the speed increase due to use of LUT method, however, LUT demand more memory. Thus, we decrease the usage of resource by applying RALUT method
- …