Search CORE

18,565 research outputs found

An 826 MOPS, 210 uW/MHz Unum ALU in 65 nm

Author: Benini Luca
Glaser Florian
Gürkaynak Frank K.
Huang Qiuting
Mach Stefan
Rahimi Abbas
Publication venue
Publication date: 04/12/2017
Field of study

To overcome the limitations of conventional floating-point number formats, an interval arithmetic and variable-width storage format called universal number (unum) has been recently introduced. This paper presents the first (to the best of our knowledge) silicon implementation measurements of an application-specific integrated circuit (ASIC) for unum floating-point arithmetic. The designed chip includes a 128-bit wide unum arithmetic unit to execute additions and subtractions, while also supporting lossless (for intermediate results) and lossy (for external data movements) compression units to exploit the memory usage reduction potential of the unum format. Our chip, fabricated in a 65 nm CMOS process, achieves a maximum clock frequency of 413 MHz at 1.2 V with an average measured power of 210 uW/MHz

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients

Author: Chávez Gustavo
Keyes David
Turkiyyah George
Zampini Stefano
Publication venue: 'Elsevier BV'
Publication date: 23/12/2017
Field of study

We present a robust and scalable preconditioner for the solution of large-scale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical low-rank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve log-linear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics, Dec 201

arXiv.org e-Print Archive

eScholarship - University of California

Approximate Computing Techniques for Low Power and Energy Efficiency

Author: Gao Mingze
Publication venue
Publication date: 01/01/2018
Field of study

Approximate computing is an emerging computation paradigm in the era of the Internet of things, big data and AI. It takes advantages of the error-tolerable feature of many applications, such as machine learning and image/signal processing, to reduce the resources consumption and delivers a certain level of computation quality. In this dissertation, we propose several data format oriented approximate computing techniques that will dramatically increase the power/energy efficiency with the insignificant loss of computational quality. For the integer computations, we propose an approximate integer format (AIF) and its associated arithmetic mechanism with controllable computation accuracy. In AIF, operands are segmented at runtime such that the computation is performed only on part of operands by computing units (such as adders and multipliers) of smaller bit-width. The proposed AIF can be used for any arithmetic operation and can be extended to fixed point numbers. AIF requires additional customized hardware support. We also provide a method that can optimize the bit-width of the fixed point computations that run on the general purpose hardware. The traditional bit-width optimization methods mainly focus on minimizing the fraction part since the integer part is restricted by the data range. In our work, we utilize the dynamic fixed point concept and the input data range as the prior knowledge to get rid of this limitation. We expand the computations into data flow graph (DFG) and propose a novel approach to estimate the error during propagation. We derive the function of energy consumption and apply a more efficient optimization strategy to balance the tradeoff between the accuracy and energy. Next, to deal with the floating point computation, we propose a runtime estimation technique by converting data into the logarithmic domain to assess the intermediate result at every node in the data flow graph. Then we evaluate the impact of each node to the overall computation quality, and decide whether we should perform an accurate computation or simply use the estimated value. To approximate the whole graph, we propose three algorithms to make the decisions at certain nodes whether these nodes can be truncated. Besides the low power and energy efficiency concern, we propose a design concept that utilizes the approximate computing to address the security concerns. We can encode the secret keys into the least significant bits of the input data, and decode the final output. In the future work, the input-output pairs will be used for device authentication, verification, and fingerprint

Digital Repository at the University of Maryland

Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices

Author: Baruch E.
Cannon L. E.
Choi J
Choi J.
Choi J.
Solomonik E.
Szabo A.
van de Geijn R. A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/10/2015
Field of study

A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text overlap with arXiv:1504.0504

arXiv.org e-Print Archive

Crossref

Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations

Author: Furber Steve
Hopkins Michael
Lester Dave R.
Mikaitis Mantas
Publication venue: 'The Royal Society'
Publication date: 01/01/2020
Field of study

Although double-precision floating-point arithmetic currently dominates high-performance computing, there is increasing interest in smaller and simpler arithmetic types. The main reasons are potential improvements in energy efficiency and memory footprint and bandwidth. However, simply switching to lower-precision types typically results in increased numerical errors. We investigate approaches to improving the accuracy of reduced-precision fixed-point arithmetic types, using examples in an important domain for numerical computation in neuroscience: the solution of Ordinary Differential Equations (ODEs). The Izhikevich neuron model is used to demonstrate that rounding has an important role in producing accurate spike timings from explicit ODE solution algorithms. In particular, fixed-point arithmetic with stochastic rounding consistently results in smaller errors compared to single precision floating-point and fixed-point arithmetic with round-to-nearest across a range of neuron behaviours and ODE solvers. A computationally much cheaper alternative is also investigated, inspired by the concept of dither that is a widely understood mechanism for providing resolution below the least significant bit (LSB) in digital signal processing. These results will have implications for the solution of ODEs in other subject areas, and should also be directly relevant to the huge range of practical problems that are represented by Partial Differential Equations (PDEs).Comment: Submitted to Philosophical Transactions of the Royal Society

arXiv.org e-Print Archive

The University of Manchester - Institutional Repository