18,565 research outputs found

    An 826 MOPS, 210 uW/MHz Unum ALU in 65 nm

    Full text link
    To overcome the limitations of conventional floating-point number formats, an interval arithmetic and variable-width storage format called universal number (unum) has been recently introduced. This paper presents the first (to the best of our knowledge) silicon implementation measurements of an application-specific integrated circuit (ASIC) for unum floating-point arithmetic. The designed chip includes a 128-bit wide unum arithmetic unit to execute additions and subtractions, while also supporting lossless (for intermediate results) and lossy (for external data movements) compression units to exploit the memory usage reduction potential of the unum format. Our chip, fabricated in a 65 nm CMOS process, achieves a maximum clock frequency of 413 MHz at 1.2 V with an average measured power of 210 uW/MHz

    Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients

    Full text link
    We present a robust and scalable preconditioner for the solution of large-scale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical low-rank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve log-linear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics, Dec 201

    Approximate Computing Techniques for Low Power and Energy Efficiency

    Get PDF
    Approximate computing is an emerging computation paradigm in the era of the Internet of things, big data and AI. It takes advantages of the error-tolerable feature of many applications, such as machine learning and image/signal processing, to reduce the resources consumption and delivers a certain level of computation quality. In this dissertation, we propose several data format oriented approximate computing techniques that will dramatically increase the power/energy efficiency with the insignificant loss of computational quality. For the integer computations, we propose an approximate integer format (AIF) and its associated arithmetic mechanism with controllable computation accuracy. In AIF, operands are segmented at runtime such that the computation is performed only on part of operands by computing units (such as adders and multipliers) of smaller bit-width. The proposed AIF can be used for any arithmetic operation and can be extended to fixed point numbers. AIF requires additional customized hardware support. We also provide a method that can optimize the bit-width of the fixed point computations that run on the general purpose hardware. The traditional bit-width optimization methods mainly focus on minimizing the fraction part since the integer part is restricted by the data range. In our work, we utilize the dynamic fixed point concept and the input data range as the prior knowledge to get rid of this limitation. We expand the computations into data flow graph (DFG) and propose a novel approach to estimate the error during propagation. We derive the function of energy consumption and apply a more efficient optimization strategy to balance the tradeoff between the accuracy and energy. Next, to deal with the floating point computation, we propose a runtime estimation technique by converting data into the logarithmic domain to assess the intermediate result at every node in the data flow graph. Then we evaluate the impact of each node to the overall computation quality, and decide whether we should perform an accurate computation or simply use the estimated value. To approximate the whole graph, we propose three algorithms to make the decisions at certain nodes whether these nodes can be truncated. Besides the low power and energy efficiency concern, we propose a design concept that utilizes the approximate computing to address the security concerns. We can encode the secret keys into the least significant bits of the input data, and decode the final output. In the future work, the input-output pairs will be used for device authentication, verification, and fingerprint

    Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices

    Full text link
    A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text overlap with arXiv:1504.0504

    Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations

    Get PDF
    Although double-precision floating-point arithmetic currently dominates high-performance computing, there is increasing interest in smaller and simpler arithmetic types. The main reasons are potential improvements in energy efficiency and memory footprint and bandwidth. However, simply switching to lower-precision types typically results in increased numerical errors. We investigate approaches to improving the accuracy of reduced-precision fixed-point arithmetic types, using examples in an important domain for numerical computation in neuroscience: the solution of Ordinary Differential Equations (ODEs). The Izhikevich neuron model is used to demonstrate that rounding has an important role in producing accurate spike timings from explicit ODE solution algorithms. In particular, fixed-point arithmetic with stochastic rounding consistently results in smaller errors compared to single precision floating-point and fixed-point arithmetic with round-to-nearest across a range of neuron behaviours and ODE solvers. A computationally much cheaper alternative is also investigated, inspired by the concept of dither that is a widely understood mechanism for providing resolution below the least significant bit (LSB) in digital signal processing. These results will have implications for the solution of ODEs in other subject areas, and should also be directly relevant to the huge range of practical problems that are represented by Partial Differential Equations (PDEs).Comment: Submitted to Philosophical Transactions of the Royal Society
    corecore