2,529 research outputs found
Optimistic Parallelization of Floating-Point Accumulation
Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6Ă with randomly generated data and 3-7Ă with summations extracted from Conjugate Gradient benchmarks
Efficiency analysis methodology of FPGAs based on lost frequencies, area and cycles
We propose a methodology to study and to quantify efficiency and the impact of overheads on runtime performance. Most work on High-Performance Computing (HPC) for FPGAs only studies runtime performance or cost, while we are interested in how far we are from peak performance and, more importantly, why. The efficiency of runtime performance is defined with respect to the ideal computational runtime in absence of inefficiencies. The analysis of the difference between actual and ideal runtime reveals the overheads and bottlenecks. A formal approach is proposed to decompose the efficiency into three components: frequency, area and cycles. After quantification of the efficiencies, a detailed analysis has to reveal the reasons for the lost frequencies, lost area and lost cycles. We propose a taxonomy of possible causes and practical methods to identify and quantify the overheads. The proposed methodology is applied on a number of use cases to illustrate the methodology. We show the interaction between the three components of efficiency and show how bottlenecks are revealed
Weighted p-bits for FPGA implementation of probabilistic circuits
Probabilistic spin logic (PSL) is a recently proposed computing paradigm
based on unstable stochastic units called probabilistic bits (p-bits) that can
be correlated to form probabilistic circuits (p-circuits). These p-circuits can
be used to solve problems of optimization, inference and also to implement
precise Boolean functions in an "inverted" mode, where a given Boolean circuit
can operate in reverse to find the input combinations that are consistent with
a given output. In this paper we present a scalable FPGA implementation of such
invertible p-circuits. We implement a "weighted" p-bit that combines stochastic
units with localized memory structures. We also present a generalized tile of
weighted p-bits to which a large class of problems beyond invertible Boolean
logic can be mapped, and how invertibility can be applied to interesting
problems such as the NP-complete Subset Sum Problem by solving a small instance
of this problem in hardware
Reproducible SUmmation under HUB Format
Version diferente del paper presentado en el congresoFloating point reproducibility is a property
claimed by programmers and end users. Half-Unit-Biased
(HUB) is a new representation format in which the round
to nearest is carried out by truncation, preventing any carry
propagation and saving time and area. In this paper we study
the reproducible summation of HUB numbers by using a errorfree
vector transformation technique, providing both a specific
architecture and the usage of combined HUB/Standard floating
point adders to achieve a reproducible resultUniversidad de MĂĄlaga. Campus de Excelencia Internacional AndalucĂa Tech
A general framework for efficient FPGA implementation of matrix product
Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe
Streaming Reduction Circuit
Reduction circuits are used to reduce rows of ïŹoating point values to single values. Binary ïŹoating point operators often have deep pipelines, which may cause hazards when many consecutive rows have to be reduced. We present an algorithm by which any number of consecutive rows of arbitrary lengths can be reduced by a pipelined commutative and associative binary operator in an efficient manner. The algorithm is simple to implement, has a low latency, produces results in-order, and requires only small buffers. Besides, it uses only a single pipeline for the involved operation. The complexity of the algorithm depends on the depth of the pipeline, not on the length of the input rows. In this paper we discuss an implementation of this algorithm and we prove its correctness
Constrained LQR for Low-Precision Data Representation
Performing computations with a low-bit number representation results in a faster implementation that uses less silicon, and hence allows an algorithm to be implemented in smaller and cheaper processors without loss of performance. We propose a novel formulation to efficiently exploit the low (or non-standard) precision number representation of some computer architectures when computing the solution to constrained LQR problems, such as those that arise in predictive control. The main idea is to include suitably-defined decision variables in the quadratic program, in addition to the states and the inputs, to allow for smaller roundoff errors in the solver. This enables one to trade off the number of bits used for data representation against speed and/or hardware resources, so that smaller numerical errors can be achieved for the same number of bits (same silicon area). Because of data dependencies, the algorithm complexity, in terms of computation time and hardware resources, does not necessarily increase despite the larger number of decision variables. Examples show that a 10-fold reduction in hardware resources is possible compared to using double precision floating point, without loss of closed-loop performance
- âŠ