1,048 research outputs found
LOCOFloat: A low-cost floating-point format for FPGAs.: Application to HIL simulators
One of the main decisions when making a digital design is which arithmetic is going
to be used. The arithmetic determines the hardware resources needed and the latency of every
operation. This is especially important in real-time applications like HIL (Hardware-in-the-loop),
where a real-time simulation of a plantâpower converter, mechanical system, or any other complex
systemâis accomplished. While a fixed-point gets optimal implementations, using considerably
fewer resources and allowing smaller simulation steps, its use is very restricted to very specific
applications, as its design effort is quite high. On the other side, IEEE-754 floating-point may
have resolution problems in case of the 32-bit version, and excessive hardware usage in case of the
64-bit version. This paper presents LOCOFloat, a low-cost floating-point format designed for FPGA
applications. Its key features are soft normalization of the results, using significand and exponent
fields in twoâs complement. This paper shows the implementation of addition, subtraction and
multiplication of the proposed format. Both IEEE-754 versions and LOCOFloat are compared in
this paper, implementing a HIL model of a buck converter. Although the application example is a
HIL simulator, other applications could take benefit from the proposed format. Results show that
LOCOFloat is as accurate as 64-bit floating-point, while reducing the use of DSPs blocks by 84%
2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation
We report on improvements made over the past two decades to our adaptive
treecode N-body method (HOT). A mathematical and computational approach to the
cosmological N-body problem is described, with performance and scalability
measured up to 256k () processors. We present error analysis and
scientific application results from a series of more than ten 69 billion
() particle cosmological simulations, accounting for
floating point operations. These results include the first simulations using
the new constraints on the standard model of cosmology from the Planck
satellite. Our simulations set a new standard for accuracy and scientific
throughput, while meeting or exceeding the computational efficiency of the
latest generation of hybrid TreePM N-body methods.Comment: 12 pages, 8 figures, 77 references; To appear in Proceedings of SC
'1
THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT
A 16 bit floating point (FP) Arithmetic Logic Unit (ALU) was designed and implemented in 0.35”m CMOS technology. Typical uses of the 16 bit FP ALU include graphics processors and embedded multimedia applications. The ALU of the modern microprocessors use a fused multiply add (FMA) design technique. An advantage of the FMA is to remove the need for a comparator which is required for a normal FP adder. The FMA consists of a multiplier, shifters, adders and rounding circuit. A fast multiplier based on the Wallace tree configuration was designed. The number of partial products was greatly reduced by the use of the modified booth encoder. The Wallace tree was chosen to reduce the number of reduction layers of partial products. The multiplier also involved the design of a pass transistor based 4:2 compressor. The average delay of the pass transistor based compressor was 55ps and was found to be 7 times faster than the full adder based 4:2 compressor. The shifters consist of separate left and right shifters using multiplexers. The shift amount is calculated using the exponents of the three operands. The addition operation is implemented using a carry skip adder (CSK). The average delay of the CSK was 1.05ns and was slower than the carry look ahead adder by about 400ps. The advantages of the CSK are reduced power, gate count and area when compared to the similar sized carry look ahead adder. The adder computes the addition of the multiplier result and the shifted value of the addend. In most modern computers, division is performed using software thereby eliminating the need for a separate hardware unit. FMA hardware unit was utilized to perform FP division. The FP divider uses the Newton Raphson algorithm to solve division by iteration. The initial approximated value with five bit accuracy was assumed to be pre-stored in cache memory and a separate clock cycle for cache read was assumed before the start of the FP division operation. In order to significantly reduce the area of the design, only one multiplier was used. Rounding to nearest technique was implemented using an 11 bit variable CSK adder. This is the best rounding technique when compared to other rounding techniques. In both the FMA and division, rounding was performed after the computation of the final result during the last clock cycle of operation. Testability analysis is performed for the multiplier which is the most complex and critical part of the FP ALU. The specific aim of testability was to ensure the correct operation of the multiplier and thus guarantee the correctness of the FMA circuit at the layout stage. The multiplier\u27s output was tested by identifying the minimal number of input vectors which toggle the inputs of the 4:2 compressors of the multiplier. The test vectors were identified in a semi automated manner using Perl scripting language. The multiplier was tested with a test set of thirty one vectors. The fault coverage of the multiplier was found to be 90.09%. The layout was implemented using IC station of Mentor Graphics CAD tool and resulted in a chip area of 1.96mm2. The specifications for basic arithmetic operations were met successfully. FP Division operation was completed within six clock cycles. The other arithmetic operations like FMA, FP addition, FP subtraction and FP multiplication were completed within three clock cycles
Implementation and Synthesis of Math Library Functions
Achieving speed and accuracy for math library functions like exp, sin, and
log is difficult. This is because low-level implementation languages like C do
not help math library developers catch mathematical errors, build
implementations incrementally, or separate high-level and low-level decision
making. This ultimately puts development of such functions out of reach for all
but the most experienced experts. To address this, we introduce MegaLibm, a
domain-specific language for implementing, testing, and tuning math library
implementations. MegaLibm is safe, modular, and tunable. Implementations in
MegaLibm can automatically detect mathematical mistakes like sign flips via
semantic wellformedness checks, and components like range reductions can be
implemented in a modular, composable way, simplifying implementations. Once the
high-level algorithm is done, tuning parameters like working precisions and
evaluation schemes can be adjusted through orthogonal tuning parameters to
achieve the desired speed and accuracy. MegaLibm also enables math library
developers to work interactively, compiling, testing, and tuning their
implementations and invoking tools like Sollya and type-directed synthesis to
complete components and synthesize entire implementations. MegaLibm can express
8 state-of-the-art math library implementations with comparable speed and
accuracy to the original C code, and can synthesize 5 variations and 3
from-scratch implementations with minimal guidance.Comment: 25 pages, 12 figure
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System
Training machine learning (ML) algorithms is a computationally intensive
process, which is frequently memory-bound due to repeatedly accessing large
training datasets. As a result, processor-centric systems (e.g., CPU, GPU)
suffer from costly data movement between memory units and processing units,
which consumes large amounts of energy and execution cycles. Memory-centric
computing systems, i.e., with processing-in-memory (PIM) capabilities, can
alleviate this data movement bottleneck.
Our goal is to understand the potential of modern general-purpose PIM
architectures to accelerate ML training. To do so, we (1) implement several
representative classic ML algorithms (namely, linear regression, logistic
regression, decision tree, K-Means clustering) on a real-world general-purpose
PIM architecture, (2) rigorously evaluate and characterize them in terms of
accuracy, performance and scaling, and (3) compare to their counterpart
implementations on CPU and GPU. Our evaluation on a real memory-centric
computing system with more than 2500 PIM cores shows that general-purpose PIM
architectures can greatly accelerate memory-bound ML workloads, when the
necessary operations and datatypes are natively supported by PIM hardware. For
example, our PIM implementation of decision tree is faster than a
state-of-the-art CPU version on an 8-core Intel Xeon, and faster
than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering
on PIM is and than state-of-the-art CPU and GPU
versions, respectively.
To our knowledge, our work is the first one to evaluate ML training on a
real-world PIM architecture. We conclude with key observations, takeaways, and
recommendations that can inspire users of ML workloads, programmers of PIM
architectures, and hardware designers & architects of future memory-centric
computing systems
The use of primitives in the calculation of radiative view factors
Compilations of radiative view factors (often in closed analytical form) are readily available in the open literature for commonly encountered geometries. For more complex three-dimensional (3D) scenarios, however, the effort required to solve the requisite multi-dimensional integrations needed to estimate a required view factor can be daunting to say the least. In such cases, a combination of finite element methods (where the geometry in question is sub-divided into a large number of uniform, often triangular, elements) and Monte Carlo Ray Tracing (MC-RT) has been developed, although frequently the software implementation is suitable only for a limited set of geometrical scenarios. Driven initially by a need to calculate the radiative heat transfer occurring within an operational fibre-drawing furnace, this research set out to examine options whereby MC-RT could be used to cost-effectively calculate any generic 3D radiative view factor using current vectorisation technologies
ColDICE: a parallel Vlasov-Poisson solver using moving adaptive simplicial tessellation
Resolving numerically Vlasov-Poisson equations for initially cold systems can
be reduced to following the evolution of a three-dimensional sheet evolving in
six-dimensional phase-space. We describe a public parallel numerical algorithm
consisting in representing the phase-space sheet with a conforming,
self-adaptive simplicial tessellation of which the vertices follow the
Lagrangian equations of motion. The algorithm is implemented both in six- and
four-dimensional phase-space. Refinement of the tessellation mesh is performed
using the bisection method and a local representation of the phase-space sheet
at second order relying on additional tracers created when needed at runtime.
In order to preserve in the best way the Hamiltonian nature of the system,
refinement is anisotropic and constrained by measurements of local Poincar\'e
invariants. Resolution of Poisson equation is performed using the fast Fourier
method on a regular rectangular grid, similarly to particle in cells codes. To
compute the density projected onto this grid, the intersection of the
tessellation and the grid is calculated using the method of Franklin and
Kankanhalli (1993) generalised to linear order. As preliminary tests of the
code, we study in four dimensional phase-space the evolution of an initially
small patch in a chaotic potential and the cosmological collapse of a
fluctuation composed of two sinusoidal waves. We also perform a "warm" dark
matter simulation in six-dimensional phase-space that we use to check the
parallel scaling of the code.Comment: Code and illustration movies available at:
http://www.vlasix.org/index.php?n=Main.ColDICE - Article submitted to Journal
of Computational Physic
A polymorphic reconfigurable emulator for parallel simulation
Microprocessor and arithmetic support chip technology was applied to the design of a reconfigurable emulator for real time flight simulation. The system developed consists of master control system to perform all man machine interactions and to configure the hardware to emulate a given aircraft, and numerous slave compute modules (SCM) which comprise the parallel computational units. It is shown that all parts of the state equations can be worked on simultaneously but that the algebraic equations cannot (unless they are slowly varying). Attempts to obtain algorithms that will allow parellel updates are reported. The word length and step size to be used in the SCM's is determined and the architecture of the hardware and software is described
MRRR-based Eigensolvers for Multi-core Processors and Supercomputers
The real symmetric tridiagonal eigenproblem is of outstanding importance in
numerical computations; it arises frequently as part of eigensolvers for
standard and generalized dense Hermitian eigenproblems that are based on a
reduction to tridiagonal form. For its solution, the algorithm of Multiple
Relatively Robust Representations (MRRR or MR3 in short) - introduced in the
late 1990s - is among the fastest methods. To compute k eigenpairs of a real
n-by-n tridiagonal T, MRRR only requires O(kn) arithmetic operations; in
contrast, all the other practical methods require O(k^2 n) or O(n^3) operations
in the worst case. This thesis centers around the performance and accuracy of
MRRR.Comment: PhD thesi
Recommended from our members
Analysing and bounding numerical error in spiking neural network simulations
This study explores how numerical error occurs in simulations of spiking neural network models, and also how this error propagates through the simulation, changing its observed behaviour. The issue of non-reproducibility in parallel spiking neural network simulations is illustrated, and a method to bound all possible trajectories is discussed. The base method used in this study is known as mixed interval and affine arithmetic (mixed IA/AA), but some extra modifications are made to improve the tightness of the error bounds.
I introduce Arpra, my new software, which is an arbitrary precision range analysis library, based on the GNU MPFR library. It improves on other implementations by enabling computations in custom floating-point precisions, and reduces the overhead rounding error of mixed IA/AA by computing in extended precision internally. It also implements a new error trimming technique, which reduces the error term whilst preserving correct boundaries. Arpra also implements deviation term condensing functions, which can reduce the number of floating-point operations per function significantly. Arpra is tested by simulating the HĂ©non map dynamical system, and found to produce tighter ranges than those of INTLAB, an alternative mixed IA/AA implementation.
Arpra is used to bound the trajectories of fan-in spiking neural network simulations. Despite performing better than interval arithmetic, the mixed IA/AA method used by Arpra is shown to be inadequate for bounding the simulation trajectories, due to the highly nonlinear nature of spiking neural networks. A stability analysis of the neural network model is performed, and it is found that error boundaries are moderately tight in non-spiking regions of state space, where linear dynamics dominate, but error boundaries explode in spiking regions of state space, where nonlinear dynamics dominate
- âŠ