Search CORE

167 research outputs found

Design of Energy-Efficient Approximate Arithmetic Circuits

Author: Shao Botang
Publication venue
Publication date: 05/02/2015
Field of study

Energy consumption has become one of the most critical design challenges in integrated circuit design. Arithmetic computing circuits, in particular array-based arithmetic computing circuits such as adders, multipliers, squarers, have been widely used. In many cases, array-based arithmetic computing circuits consume a significant amount of energy in a chip design. Hence, reduction of energy consumption of array-based arithmetic computing circuits is an important design consideration. To this end, designing low-power arithmetic circuits by intelligently trading off processing precision for energy saving in error-resilient applications such as DSP, machine learning and neuromorphic circuits provides a promising solution to the energy dissipation challenge of such systems. To solve the chip’s energy problem, especially for those applications with inherent error resilience, array-based approximate arithmetic computing (AAAC) circuits that produce errors while having improved energy efficiency have been proposed. Specifically, a number of approximate adders, multipliers and squarers have been presented in the literature. However, the chief limitation of these designs is their un-optimized processing accuracy, which is largely due to the current lack of systemic guidance for array-based AAAC circuit design pertaining to optimal tradeoffs between error, energy and area overhead. Therefore, in this research, our first contribution is to propose a general model for approximate array-based approximate arithmetic computing to guide the minimization of processing error. As part of this model, the Error Compensation Unit (ECU) is identified as a key building block for a wide range of AAAC circuits. We develop theoretical analysis geared towards addressing two critical design problems of the ECU, namely, determination of optimal error compensation values and identification of the optimal error compensation scheme. We demonstrate how this general AAAC model can be leveraged to derive practical design insights that may lead to optimal tradeoffs between accuracy, energy dissipation and area overhead. To further minimize energy consumption, delay and area of AAAC circuits, we perform ECU logic simplification by introducing don't cares. By applying the proposed model, we propose an approximate 16x16 fixed-width Booth multiplier that consumes 44.85% and 28.33% less energy and area compared with theoretically the most accurate fixed-width Booth multiplier when implemented using a 90nm CMOS standard cell library. Furthermore, it reduces average error, max error and mean square error by 11.11%, 28.11% and 25.00%, respectively, when compared with the best reported approximate Booth multiplier and outperforms the best reported approximate design significantly by 19.10% in terms of the energy-delay-mean square error product (EDE_(ms)). Using the same approach, significant energy consumption, area and error reduction is achieved for a squarer unit, with more than 20.00% EDE_(ms) reduction over existing fixed-width squarer designs. To further reduce error and cost by utilizing extra signatures and don't cares, we demonstrate a 16-bit fixed-width squarer that improves the energy-delay-max error (EDE_(max)) by 15.81%

Texas A&M Repository

Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

Author: Armeniakos Giorgos
Hanif Muhammad Abdullah
Jiao Xun
Leon Vasileios
Pekmestzi Kiamal
Shafique Muhammad
Soudris Dimitrios
Publication venue
Publication date: 20/07/2023
Field of study

The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

arXiv.org e-Print Archive

Customisable arithmetic hardware designs

Author: Cheung Chak-Chung Ray
Cheung Chak-Chung Ray
Publication venue
Publication date: 01/01/2007
Field of study

Imperial Users onl

Spiral - Imperial College Digital Repository

Serial-data computation in VLSI

Author: Smith Stewart Gresty
Publication venue: The University of Edinburgh
Publication date: 01/01/1987
Field of study

Edinburgh Research Archive

A high-speed integrated circuit with applications to RSA Cryptography

Author: Onions Paul David
Publication venue: 'University of Plymouth'
Publication date: 01/01/1995
Field of study

Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and private communications systems has led to an increasing need for these systems to be secure against unauthorised access and eavesdropping. To this end, modern computer security systems employ public-key ciphers, of which probably the most well known is the RSA ciphersystem, to provide both secrecy and authentication facilities. The basic RSA cryptographic operation is a modular exponentiation where the modulus and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable encryption rates using the RSA cipher requires that it be implemented in hardware. This thesis presents the design of a high-performance VLSI device, called the WHiSpER chip, that can perform the modular exponentiations required by the RSA cryptosystem for moduli and exponents up to 506 bits long. The design has an expected throughput in excess of 64kbit/s making it attractive for use both as a general RSA processor within the security function provider of a security system, and for direct use on moderate-speed public communication networks such as ISDN. The thesis investigates the low-level techniques used for implementing high-speed arithmetic hardware in general, and reviews the methods used by designers of existing modular multiplication/exponentiation circuits with respect to circuit speed and efficiency. A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic, together with an efficient multiplier architecture, are proposed that remove the speed bottleneck of previous designs. Finally, the implementation of the new algorithm and architecture within the WHiSpER chip is detailed, along with a discussion of the application of the chip to ciphering and key generation

Plymouth Electronic Archive and Research Library

OpenGrey Repository

High sample-rate Givens rotations for recursive least squares

Author: Walke Richard Lewis
Publication venue
Publication date
Field of study

The design of an application-specific integrated circuit of a parallel array processor is considered for recursive least squares by QR decomposition using Givens rotations, applicable in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation, which, for this recursive algorithm, means that the time to perform arithmetic operations is critical. The algorithm, architecture and arithmetic are considered in a single integrated design procedure to achieve optimum results. A realisation approach using standard arithmetic operators, add, multiply and divide is adopted. The design of high-throughput operators with low delay is addressed for fixed- and floating-point number formats, and the application of redundant arithmetic considered. New redundant multiplier architectures are presented enabling reductions in area of up to 25%, whilst maintaining low delay. A technique is presented enabling the use of a conventional tree multiplier in recursive applications, allowing savings in area and delay. Two new divider architectures are presented showing benefits compared with the radix-2 modified SRT algorithm. Givens rotation algorithms are examined to determine their suitability for VLSI implementation. A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed enabling the sample-rate to be increased by a factor of approximately 6 and offering area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of 136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology. The enhanced SGR algorithm has been compared with a CORDIC approach and shown to benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation on a parallel array of general purpose (GP) DSP chips, it is estimated that a single application specific chip could offer up to 1,500 times the computation obtained from a single OP DSP chip

Warwick Research Archives Portal Repository

A full-custom digital-signal-processing unit for real-time cortical blood flow monitoring

Author: HONG ZHIQIAN
Publication venue
Publication date: 24/12/2009
Field of study

Master'sMASTER OF ENGINEERIN

ScholarBank@NUS

PARMESAN: Parallel ARithMEticS over ENcrypted data

Author: Jakub Klemsa
Melek Önen
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 26/04/2023
Field of study

Fully Homomorphic Encryption enables the evaluation of an arbitrary computable function over encrypted data. Among all such functions, particular interest goes for integer arithmetics. In this paper, we present a bundle of methods for fast arithmetic operations over encrypted data: addition/subtraction, multiplication, and some of their special cases. On top of that, we propose techniques for signum, maximum, and rounding. All methods are specifically tailored for computations with data encrypted with the TFHE scheme (Chillotti et al., Asiacrypt \u2716) and we mainly focus on parallelization of non-linear homomorphic operations, which are the most expensive ones. This way, evaluation times can be reduced significantly, provided that sufficient parallel resources are available. We implement all presented methods in the Parmesan Library and we provide an experimental evaluation. Compared to integer arithmetics of the Concrete Library, we achieve considerable speedups for all comparable operations. Major speedups are achieved for the multiplication of an encrypted integer by a cleartext one, where we employ special addition-subtraction chains, which save a vast amount of homomorphic operations

Cryptology ePrint Archive

Versatile Montgomery Multiplier Architectures

Author: Gaubatz Gunnar
Publication venue: Digital WPI
Publication date: 30/04/2002
Field of study

Several algorithms for Public Key Cryptography (PKC), such as RSA, Diffie-Hellman, and Elliptic Curve Cryptography, require modular multiplication of very large operands (sizes from 160 to 4096 bits) as their core arithmetic operation. To perform this operation reasonably fast, general purpose processors are not always the best choice. This is why specialized hardware, in the form of cryptographic co-processors, become more attractive. Based upon the analysis of recent publications on hardware design for modular multiplication, this M.S. thesis presents a new architecture that is scalable with respect to word size and pipelining depth. To our knowledge, this is the first time a word based algorithm for Montgomery\u27s method is realized using high-radix bit-parallel multipliers working with two different types of finite fields (unified architecture for GF(p) and GF(2n)). Previous approaches have relied mostly on bit serial multiplication in combination with massive pipelining, or Radix-8 multiplication with the limitation to a single type of finite field. Our approach is centered around the notion that the optimal delay in bit-parallel multipliers grows with logarithmic complexity with respect to the operand size n, O(log3/2 n), while the delay of bit serial implementations grows with linear complexity O(n). Our design has been implemented in VHDL, simulated and synthesized in 0.5μ CMOS technology. The synthesized net list has been verified in back-annotated timing simulations and analyzed in terms of performance and area consumption

DigitalCommons@WPI