2,529 research outputs found

    Optimistic Parallelization of Floating-Point Accumulation

    Get PDF
    Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks

    Automated design of low complexity FIR filters

    Get PDF

    Efficiency analysis methodology of FPGAs based on lost frequencies, area and cycles

    Get PDF
    We propose a methodology to study and to quantify efficiency and the impact of overheads on runtime performance. Most work on High-Performance Computing (HPC) for FPGAs only studies runtime performance or cost, while we are interested in how far we are from peak performance and, more importantly, why. The efficiency of runtime performance is defined with respect to the ideal computational runtime in absence of inefficiencies. The analysis of the difference between actual and ideal runtime reveals the overheads and bottlenecks. A formal approach is proposed to decompose the efficiency into three components: frequency, area and cycles. After quantification of the efficiencies, a detailed analysis has to reveal the reasons for the lost frequencies, lost area and lost cycles. We propose a taxonomy of possible causes and practical methods to identify and quantify the overheads. The proposed methodology is applied on a number of use cases to illustrate the methodology. We show the interaction between the three components of efficiency and show how bottlenecks are revealed

    Weighted p-bits for FPGA implementation of probabilistic circuits

    Full text link
    Probabilistic spin logic (PSL) is a recently proposed computing paradigm based on unstable stochastic units called probabilistic bits (p-bits) that can be correlated to form probabilistic circuits (p-circuits). These p-circuits can be used to solve problems of optimization, inference and also to implement precise Boolean functions in an "inverted" mode, where a given Boolean circuit can operate in reverse to find the input combinations that are consistent with a given output. In this paper we present a scalable FPGA implementation of such invertible p-circuits. We implement a "weighted" p-bit that combines stochastic units with localized memory structures. We also present a generalized tile of weighted p-bits to which a large class of problems beyond invertible Boolean logic can be mapped, and how invertibility can be applied to interesting problems such as the NP-complete Subset Sum Problem by solving a small instance of this problem in hardware

    Reproducible SUmmation under HUB Format

    Get PDF
    Version diferente del paper presentado en el congresoFloating point reproducibility is a property claimed by programmers and end users. Half-Unit-Biased (HUB) is a new representation format in which the round to nearest is carried out by truncation, preventing any carry propagation and saving time and area. In this paper we study the reproducible summation of HUB numbers by using a errorfree vector transformation technique, providing both a specific architecture and the usage of combined HUB/Standard floating point adders to achieve a reproducible resultUniversidad de MĂĄlaga. Campus de Excelencia Internacional AndalucĂ­a Tech

    A general framework for efficient FPGA implementation of matrix product

    Get PDF
    Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe

    Streaming Reduction Circuit

    Get PDF
    Reduction circuits are used to reduce rows of ïŹ‚oating point values to single values. Binary ïŹ‚oating point operators often have deep pipelines, which may cause hazards when many consecutive rows have to be reduced. We present an algorithm by which any number of consecutive rows of arbitrary lengths can be reduced by a pipelined commutative and associative binary operator in an efficient manner. The algorithm is simple to implement, has a low latency, produces results in-order, and requires only small buffers. Besides, it uses only a single pipeline for the involved operation. The complexity of the algorithm depends on the depth of the pipeline, not on the length of the input rows. In this paper we discuss an implementation of this algorithm and we prove its correctness

    Constrained LQR for Low-Precision Data Representation

    Get PDF
    Performing computations with a low-bit number representation results in a faster implementation that uses less silicon, and hence allows an algorithm to be implemented in smaller and cheaper processors without loss of performance. We propose a novel formulation to efficiently exploit the low (or non-standard) precision number representation of some computer architectures when computing the solution to constrained LQR problems, such as those that arise in predictive control. The main idea is to include suitably-defined decision variables in the quadratic program, in addition to the states and the inputs, to allow for smaller roundoff errors in the solver. This enables one to trade off the number of bits used for data representation against speed and/or hardware resources, so that smaller numerical errors can be achieved for the same number of bits (same silicon area). Because of data dependencies, the algorithm complexity, in terms of computation time and hardware resources, does not necessarily increase despite the larger number of decision variables. Examples show that a 10-fold reduction in hardware resources is possible compared to using double precision floating point, without loss of closed-loop performance
    • 

    corecore