18,565 research outputs found
An 826 MOPS, 210 uW/MHz Unum ALU in 65 nm
To overcome the limitations of conventional floating-point number formats, an
interval arithmetic and variable-width storage format called universal number
(unum) has been recently introduced. This paper presents the first (to the best
of our knowledge) silicon implementation measurements of an
application-specific integrated circuit (ASIC) for unum floating-point
arithmetic. The designed chip includes a 128-bit wide unum arithmetic unit to
execute additions and subtractions, while also supporting lossless (for
intermediate results) and lossy (for external data movements) compression units
to exploit the memory usage reduction potential of the unum format. Our chip,
fabricated in a 65 nm CMOS process, achieves a maximum clock frequency of 413
MHz at 1.2 V with an average measured power of 210 uW/MHz
Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients
We present a robust and scalable preconditioner for the solution of
large-scale linear systems that arise from the discretization of elliptic PDEs
amenable to rank compression. The preconditioner is based on hierarchical
low-rank approximations and the cyclic reduction method. The setup and
application phases of the preconditioner achieve log-linear complexity in
memory footprint and number of operations, and numerical experiments exhibit
good weak and strong scalability at large processor counts in a distributed
memory environment. Numerical experiments with linear systems that feature
symmetry and nonsymmetry, definiteness and indefiniteness, constant and
variable coefficients demonstrate the preconditioner applicability and
robustness. Furthermore, it is possible to control the number of iterations via
the accuracy threshold of the hierarchical matrix approximations and their
arithmetic operations, and the tuning of the admissibility condition parameter.
Together, these parameters allow for optimization of the memory requirements
and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics,
Dec 201
Approximate Computing Techniques for Low Power and Energy Efficiency
Approximate computing is an emerging computation paradigm in the era of the Internet of things, big data and AI. It takes advantages of the error-tolerable feature of many applications, such as machine learning and image/signal processing, to reduce the resources consumption and delivers a certain level of computation quality. In this dissertation, we propose several data format oriented approximate computing techniques that will dramatically increase the power/energy efficiency with the insignificant loss of computational quality.
For the integer computations, we propose an approximate integer format (AIF) and its associated arithmetic mechanism with controllable computation accuracy. In AIF, operands are segmented at runtime such that the computation is performed only on part of operands by computing units (such as adders and multipliers) of smaller bit-width. The proposed AIF can be used for any arithmetic operation and can be extended to fixed point numbers.
AIF requires additional customized hardware support. We also provide a method that can optimize the bit-width of the fixed point computations that run on the general purpose hardware. The traditional bit-width optimization methods mainly focus on minimizing the fraction part since the integer part is restricted by the data range. In our work, we utilize the dynamic fixed point concept and the input data range as the prior knowledge to get rid of this limitation. We expand the computations into data flow graph (DFG) and propose a novel approach to estimate the error during propagation. We derive the function of energy consumption and apply a more efficient optimization strategy to balance the tradeoff between the accuracy and energy.
Next, to deal with the floating point computation, we propose a runtime estimation technique by converting data into the logarithmic domain to assess the intermediate result at every node in the data flow graph. Then we evaluate the impact of each node to the overall computation quality, and decide whether we should perform an accurate computation or simply use the estimated value. To approximate the whole graph, we propose three algorithms to make the decisions at certain nodes whether these nodes can be truncated.
Besides the low power and energy efficiency concern, we propose a design concept that utilizes the approximate computing to address the security concerns. We can encode the secret keys into the least significant bits of the input data, and decode the final output. In the future work, the input-output pairs will be used for device authentication, verification, and fingerprint
Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices
A task-based formulation of Scalable Universal Matrix Multiplication
Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is
applied to the multiplication of hierarchy-free, rank-structured matrices that
appear in the domain of quantum chemistry (QC). The novel features of our
formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and
(2) fine-grained task-based composition. These features make it tolerant of the
load imbalance due to the irregular matrix structure and eliminate all
artifactual sources of global synchronization.Scalability of iterative
computation of square-root inverse of block-rank-sparse QC matrices is
demonstrated; for full-rank (dense) matrices the performance of our SUMMA
formulation usually exceeds that of the state-of-the-art dense MM
implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text
overlap with arXiv:1504.0504
Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations
Although double-precision floating-point arithmetic currently dominates
high-performance computing, there is increasing interest in smaller and simpler
arithmetic types. The main reasons are potential improvements in energy
efficiency and memory footprint and bandwidth. However, simply switching to
lower-precision types typically results in increased numerical errors. We
investigate approaches to improving the accuracy of reduced-precision
fixed-point arithmetic types, using examples in an important domain for
numerical computation in neuroscience: the solution of Ordinary Differential
Equations (ODEs). The Izhikevich neuron model is used to demonstrate that
rounding has an important role in producing accurate spike timings from
explicit ODE solution algorithms. In particular, fixed-point arithmetic with
stochastic rounding consistently results in smaller errors compared to single
precision floating-point and fixed-point arithmetic with round-to-nearest
across a range of neuron behaviours and ODE solvers. A computationally much
cheaper alternative is also investigated, inspired by the concept of dither
that is a widely understood mechanism for providing resolution below the least
significant bit (LSB) in digital signal processing. These results will have
implications for the solution of ODEs in other subject areas, and should also
be directly relevant to the huge range of practical problems that are
represented by Partial Differential Equations (PDEs).Comment: Submitted to Philosophical Transactions of the Royal Society
- …