Search CORE

1,048 research outputs found

LOCOFloat: A low-cost floating-point format for FPGAs.: Application to HIL simulators

Author: Castro Ángel de
Garrido Javier
Martínez-García María Sofía
Sánchez Alberto
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

One of the main decisions when making a digital design is which arithmetic is going to be used. The arithmetic determines the hardware resources needed and the latency of every operation. This is especially important in real-time applications like HIL (Hardware-in-the-loop), where a real-time simulation of a plant—power converter, mechanical system, or any other complex system—is accomplished. While a fixed-point gets optimal implementations, using considerably fewer resources and allowing smaller simulation steps, its use is very restricted to very specific applications, as its design effort is quite high. On the other side, IEEE-754 floating-point may have resolution problems in case of the 32-bit version, and excessive hardware usage in case of the 64-bit version. This paper presents LOCOFloat, a low-cost floating-point format designed for FPGA applications. Its key features are soft normalization of the results, using significand and exponent fields in two’s complement. This paper shows the implementation of addition, subtraction and multiplication of the proposed format. Both IEEE-754 versions and LOCOFloat are compared in this paper, implementing a HIL model of a buck converter. Although the application example is a HIL simulator, other applications could take benefit from the proposed format. Results show that LOCOFloat is as accurate as 64-bit floating-point, while reducing the use of DSPs blocks by 84%

Biblos-e Archivo

2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation

Author: Warren Michael S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/10/2013
Field of study

We report on improvements made over the past two decades to our adaptive treecode N-body method (HOT). A mathematical and computational approach to the cosmological N-body problem is described, with performance and scalability measured up to 256k (

2^{18}

) processors. We present error analysis and scientific application results from a series of more than ten 69 billion (

4096^3

) particle cosmological simulations, accounting for

4 \times 10^{20}

floating point operations. These results include the first simulations using the new constraints on the standard model of cosmology from the Planck satellite. Our simulations set a new standard for accuracy and scientific throughput, while meeting or exceeding the computational efficiency of the latest generation of hybrid TreePM N-body methods.Comment: 12 pages, 8 figures, 77 references; To appear in Proceedings of SC '1

arXiv.org e-Print Archive

CiteSeerX

Directory of Open Access Journals

THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT

Author: Kannan Balaji
Publication venue: Clemson University Libraries
Publication date: 01/12/2009
Field of study

A 16 bit floating point (FP) Arithmetic Logic Unit (ALU) was designed and implemented in 0.35µm CMOS technology. Typical uses of the 16 bit FP ALU include graphics processors and embedded multimedia applications. The ALU of the modern microprocessors use a fused multiply add (FMA) design technique. An advantage of the FMA is to remove the need for a comparator which is required for a normal FP adder. The FMA consists of a multiplier, shifters, adders and rounding circuit. A fast multiplier based on the Wallace tree configuration was designed. The number of partial products was greatly reduced by the use of the modified booth encoder. The Wallace tree was chosen to reduce the number of reduction layers of partial products. The multiplier also involved the design of a pass transistor based 4:2 compressor. The average delay of the pass transistor based compressor was 55ps and was found to be 7 times faster than the full adder based 4:2 compressor. The shifters consist of separate left and right shifters using multiplexers. The shift amount is calculated using the exponents of the three operands. The addition operation is implemented using a carry skip adder (CSK). The average delay of the CSK was 1.05ns and was slower than the carry look ahead adder by about 400ps. The advantages of the CSK are reduced power, gate count and area when compared to the similar sized carry look ahead adder. The adder computes the addition of the multiplier result and the shifted value of the addend. In most modern computers, division is performed using software thereby eliminating the need for a separate hardware unit. FMA hardware unit was utilized to perform FP division. The FP divider uses the Newton Raphson algorithm to solve division by iteration. The initial approximated value with five bit accuracy was assumed to be pre-stored in cache memory and a separate clock cycle for cache read was assumed before the start of the FP division operation. In order to significantly reduce the area of the design, only one multiplier was used. Rounding to nearest technique was implemented using an 11 bit variable CSK adder. This is the best rounding technique when compared to other rounding techniques. In both the FMA and division, rounding was performed after the computation of the final result during the last clock cycle of operation. Testability analysis is performed for the multiplier which is the most complex and critical part of the FP ALU. The specific aim of testability was to ensure the correct operation of the multiplier and thus guarantee the correctness of the FMA circuit at the layout stage. The multiplier\u27s output was tested by identifying the minimal number of input vectors which toggle the inputs of the 4:2 compressors of the multiplier. The test vectors were identified in a semi automated manner using Perl scripting language. The multiplier was tested with a test set of thirty one vectors. The fault coverage of the multiplier was found to be 90.09%. The layout was implemented using IC station of Mentor Graphics CAD tool and resulted in a chip area of 1.96mm2. The specifications for basic arithmetic operations were met successfully. FP Division operation was completed within six clock cycles. The other arithmetic operations like FMA, FP addition, FP subtraction and FP multiplication were completed within three clock cycles

Clemson University: TigerPrints

Implementation and Synthesis of Math Library Functions

Author: Briggs Ian
Lad Yash
Panchekha Pavel
Publication venue
Publication date: 02/11/2023
Field of study

Achieving speed and accuracy for math library functions like exp, sin, and log is difficult. This is because low-level implementation languages like C do not help math library developers catch mathematical errors, build implementations incrementally, or separate high-level and low-level decision making. This ultimately puts development of such functions out of reach for all but the most experienced experts. To address this, we introduce MegaLibm, a domain-specific language for implementing, testing, and tuning math library implementations. MegaLibm is safe, modular, and tunable. Implementations in MegaLibm can automatically detect mathematical mistakes like sign flips via semantic wellformedness checks, and components like range reductions can be implemented in a modular, composable way, simplifying implementations. Once the high-level algorithm is done, tuning parameters like working precisions and evaluation schemes can be adjusted through orthogonal tuning parameters to achieve the desired speed and accuracy. MegaLibm also enables math library developers to work interactively, compiling, testing, and tuning their implementations and invoking tools like Sollya and type-directed synthesis to complete components and synthesize entire implementations. MegaLibm can express 8 state-of-the-art math library implementations with comparable speed and accuracy to the original C code, and can synthesize 5 variations and 3 from-scratch implementations with minimal guidance.Comment: 25 pages, 12 figure

arXiv.org e-Print Archive

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Author: Brocard Sylvan
Cimadomo Remy
Guo Yuxin
Gómez-Luna Juan
Legriel Julien
Mutlu Onur
Oliveira Geraldo F.
Singh Gagandeep
Publication venue
Publication date: 20/04/2023
Field of study

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is

27\times

faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and

1.34\times

faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is

2.8\times

and

3.2\times

than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems

arXiv.org e-Print Archive

The use of primitives in the calculation of radiative view factors

Author: Walker Trevor John
Publication venue: Faculty of Engineering and Information Technologies, School of Chemical and Biomolecular Engineering
Publication date: 01/01/2014
Field of study

Compilations of radiative view factors (often in closed analytical form) are readily available in the open literature for commonly encountered geometries. For more complex three-dimensional (3D) scenarios, however, the effort required to solve the requisite multi-dimensional integrations needed to estimate a required view factor can be daunting to say the least. In such cases, a combination of finite element methods (where the geometry in question is sub-divided into a large number of uniform, often triangular, elements) and Monte Carlo Ray Tracing (MC-RT) has been developed, although frequently the software implementation is suitable only for a limited set of geometrical scenarios. Driven initially by a need to calculate the radiative heat transfer occurring within an operational fibre-drawing furnace, this research set out to examine options whereby MC-RT could be used to cost-effectively calculate any generic 3D radiative view factor using current vectorisation technologies

Sydney eScholarship

ColDICE: a parallel Vlasov-Poisson solver using moving adaptive simplicial tessellation

Author: Colombi Stéphane
Sousbie Thierry
Publication venue: 'Elsevier BV'
Publication date: 23/09/2015
Field of study

Resolving numerically Vlasov-Poisson equations for initially cold systems can be reduced to following the evolution of a three-dimensional sheet evolving in six-dimensional phase-space. We describe a public parallel numerical algorithm consisting in representing the phase-space sheet with a conforming, self-adaptive simplicial tessellation of which the vertices follow the Lagrangian equations of motion. The algorithm is implemented both in six- and four-dimensional phase-space. Refinement of the tessellation mesh is performed using the bisection method and a local representation of the phase-space sheet at second order relying on additional tracers created when needed at runtime. In order to preserve in the best way the Hamiltonian nature of the system, refinement is anisotropic and constrained by measurements of local Poincar\'e invariants. Resolution of Poisson equation is performed using the fast Fourier method on a regular rectangular grid, similarly to particle in cells codes. To compute the density projected onto this grid, the intersection of the tessellation and the grid is calculated using the method of Franklin and Kankanhalli (1993) generalised to linear order. As preliminary tests of the code, we study in four dimensional phase-space the evolution of an initially small patch in a chaotic potential and the cosmological collapse of a fluctuation composed of two sinusoidal waves. We also perform a "warm" dark matter simulation in six-dimensional phase-space that we use to check the parallel scaling of the code.Comment: Code and illustration movies available at: http://www.vlasix.org/index.php?n=Main.ColDICE - Article submitted to Journal of Computational Physic

arXiv.org e-Print Archive

HAL-INSU

A polymorphic reconfigurable emulator for parallel simulation

Author: Cook G.
Mcvey E. S.
Parrish E. A., Jr.
Publication venue
Publication date
Field of study

Microprocessor and arithmetic support chip technology was applied to the design of a reconfigurable emulator for real time flight simulation. The system developed consists of master control system to perform all man machine interactions and to configure the hardware to emulate a given aircraft, and numerous slave compute modules (SCM) which comprise the parallel computational units. It is shown that all parts of the state equations can be worked on simultaneously but that the algebraic equations cannot (unless they are slowly varying). Attempts to obtain algorithms that will allow parellel updates are reported. The word length and step size to be used in the SCM's is determined and the architecture of the hardware and software is described

NASA Technical Reports Server

MRRR-based Eigensolvers for Multi-core Processors and Supercomputers

Author: Petschow Matthias
Publication venue
Publication date: 01/01/2013
Field of study

The real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based on a reduction to tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR or MR3 in short) - introduced in the late 1990s - is among the fastest methods. To compute k eigenpairs of a real n-by-n tridiagonal T, MRRR only requires O(kn) arithmetic operations; in contrast, all the other practical methods require O(k^2 n) or O(n^3) operations in the worst case. This thesis centers around the performance and accuracy of MRRR.Comment: PhD thesi

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Recommended from our members

Analysing and bounding numerical error in spiking neural network simulations

Author: Turner James Paul
Publication venue
Publication date: 06/03/2020
Field of study

This study explores how numerical error occurs in simulations of spiking neural network models, and also how this error propagates through the simulation, changing its observed behaviour. The issue of non-reproducibility in parallel spiking neural network simulations is illustrated, and a method to bound all possible trajectories is discussed. The base method used in this study is known as mixed interval and affine arithmetic (mixed IA/AA), but some extra modifications are made to improve the tightness of the error bounds. I introduce Arpra, my new software, which is an arbitrary precision range analysis library, based on the GNU MPFR library. It improves on other implementations by enabling computations in custom floating-point precisions, and reduces the overhead rounding error of mixed IA/AA by computing in extended precision internally. It also implements a new error trimming technique, which reduces the error term whilst preserving correct boundaries. Arpra also implements deviation term condensing functions, which can reduce the number of floating-point operations per function significantly. Arpra is tested by simulating the Hénon map dynamical system, and found to produce tighter ranges than those of INTLAB, an alternative mixed IA/AA implementation. Arpra is used to bound the trajectories of fan-in spiking neural network simulations. Despite performing better than interval arithmetic, the mixed IA/AA method used by Arpra is shown to be inadequate for bounding the simulation trajectories, due to the highly nonlinear nature of spiking neural networks. A stability analysis of the neural network model is performed, and it is found that error boundaries are moderately tight in non-spiking regions of state space, where linear dynamics dominate, but error boundaries explode in spiking regions of state space, where nonlinear dynamics dominate

Sussex Research Online