482 research outputs found

    Accuracy-Guaranteed Fixed-Point Optimization in Hardware Synthesis and Processor Customization

    Get PDF
    RÉSUMÉ De nos jours, le calcul avec des nombres fractionnaires est essentiel dans une vaste gamme d’applications de traitement de signal et d’image. Pour le calcul numĂ©rique, un nombre fractionnaire peut ĂȘtre reprĂ©sentĂ© Ă  l’aide de l’arithmĂ©tique en virgule fixe ou en virgule flottante. L’arithmĂ©tique en virgule fixe est largement considĂ©rĂ©e prĂ©fĂ©rable Ă  celle en virgule flottante pour les architectures matĂ©rielles dĂ©diĂ©es en raison de sa plus faible complexitĂ© d’implĂ©mentation. Dans la mise en Ɠuvre du matĂ©riel, la largeur de mot attribuĂ©e Ă  diffĂ©rents signaux a un impact significatif sur des mĂ©triques telles que les ressources (transistors), la vitesse et la consommation d'Ă©nergie. L'optimisation de longueur de mot (WLO) en virgule fixe est un domaine de recherche bien connu qui vise Ă  optimiser les chemins de donnĂ©es par l'ajustement des longueurs de mots attribuĂ©es aux signaux. Un nombre en virgule fixe est composĂ© d’une partie entiĂšre et d’une partie fractionnaire. Il y a une limite infĂ©rieure au nombre de bits allouĂ©s Ă  la partie entiĂšre, de façon Ă  prĂ©venir les dĂ©bordements pour chaque signal. Cette limite dĂ©pend de la gamme de valeurs que peut prendre le signal. Le nombre de bits de la partie fractionnaire, quant Ă  lui, dĂ©termine la taille de l'erreur de prĂ©cision finie qui est introduite dans les calculs. Il existe un compromis entre la prĂ©cision et l'efficacitĂ© du matĂ©riel dans la sĂ©lection du nombre de bits de la partie fractionnaire. Le processus d'attribution du nombre de bits de la partie fractionnaire comporte deux procĂ©dures importantes: la modĂ©lisation de l'erreur de quantification et la sĂ©lection de la taille de la partie fractionnaire. Les travaux existants sur la WLO ont portĂ© sur des circuits spĂ©cialisĂ©s comme plate-forme cible. Dans cette thĂšse, nous introduisons de nouvelles mĂ©thodologies, techniques et algorithmes pour amĂ©liorer l’implĂ©mentation de calculs en virgule fixe dans des circuits et processeurs spĂ©cialisĂ©s. La thĂšse propose une approche amĂ©liorĂ©e de modĂ©lisation d’erreur, basĂ©e sur l'arithmĂ©tique affine, qui aborde certains problĂšmes des mĂ©thodes existantes et amĂ©liore leur prĂ©cision. La thĂšse introduit Ă©galement une technique d'accĂ©lĂ©ration et deux algorithmes semi-analytiques pour la sĂ©lection de la largeur de la partie fractionnaire pour la conception de circuits spĂ©cialisĂ©s. Alors que le premier algorithme suit une stratĂ©gie de recherche progressive, le second utilise une mĂ©thode de recherche en forme d'arbre pour l'optimisation de la largeur fractionnaire. Les algorithmes offrent deux options de compromis entre la complexitĂ© de calcul et le coĂ»t rĂ©sultant. Le premier algorithme a une complexitĂ© polynomiale et obtient des rĂ©sultats comparables avec des approches heuristiques existantes. Le second algorithme a une complexitĂ© exponentielle, mais il donne des rĂ©sultats quasi-optimaux par rapport Ă  une recherche exhaustive. Cette thĂšse propose Ă©galement une mĂ©thode pour combiner l'optimisation de la longueur des mots dans un contexte de conception de processeurs configurables. La largeur et la profondeur des blocs de registres et l'architecture des unitĂ©s fonctionnelles sont les principaux objectifs ciblĂ©s par cette optimisation. Un nouvel algorithme d'optimisation a Ă©tĂ© dĂ©veloppĂ© pour trouver la meilleure combinaison de longueurs de mots et d'autres paramĂštres configurables dans la mĂ©thode proposĂ©e. Les exigences de prĂ©cision, dĂ©finies comme l'erreur pire cas, doivent ĂȘtre respectĂ©es par toute solution. Pour faciliter l'Ă©valuation et la mise en Ɠuvre des solutions retenues, un nouvel environnement de conception de processeur a Ă©galement Ă©tĂ© dĂ©veloppĂ©. Cet environnement, qui est appelĂ© PolyCuSP, supporte une large gamme de paramĂštres, y compris ceux qui sont nĂ©cessaires pour Ă©valuer les solutions proposĂ©es par l'algorithme d'optimisation. L’environnement PolyCuSP soutient l’exploration rapide de l'espace de solution et la capacitĂ© de modĂ©liser diffĂ©rents jeux d'instructions pour permettre des comparaisons efficaces.----------ABSTRACT Fixed-point arithmetic is broadly preferred to floating-point in hardware development due to the reduced hardware complexity of fixed-point circuits. In hardware implementation, the bitwidth allocated to the data elements has significant impact on efficiency metrics for the circuits including area usage, speed and power consumption. Fixed-point word-length optimization (WLO) is a well-known research area. It aims to optimize fixed-point computational circuits through the adjustment of the allocated bitwidths of their internal and output signals. A fixed-point number is composed of an integer part and a fractional part. There is a minimum number of bits for the integer part that guarantees overflow and underflow avoidance in each signal. This value depends on the range of values that the signal may take. The fractional word-length determines the amount of finite-precision error that is introduced in the computations. There is a trade-off between accuracy and hardware cost in fractional word-length selection. The process of allocating the fractional word-length requires two important procedures: finite-precision error modeling and fractional word-length selection. Existing works on WLO have focused on hardwired circuits as the target implementation platform. In this thesis, we introduce new methodologies, techniques and algorithms to improve the hardware realization of fixed-point computations in hardwired circuits and customizable processors. The thesis proposes an enhanced error modeling approach based on affine arithmetic that addresses some shortcomings of the existing methods and improves their accuracy. The thesis also introduces an acceleration technique and two semi-analytical fractional bitwidth selection algorithms for WLO in hardwired circuit design. While the first algorithm follows a progressive search strategy, the second one uses a tree-shaped search method for fractional width optimization. The algorithms offer two different time-complexity/cost efficiency trade-off options. The first algorithm has polynomial complexity and achieves comparable results with existing heuristic approaches. The second algorithm has exponential complexity but achieves near-optimal results compared to an exhaustive search. The thesis further proposes a method to combine word-length optimization with application-specific processor customization. The supported datatype word-length, the size of register-files and the architecture of the functional units are the main target objectives to be optimized. A new optimization algorithm is developed to find the best combination of word-length and other customizable parameters in the proposed method. Accuracy requirements, defined as the worst-case error bound, are the key consideration that must be met by any solution. To facilitate evaluation and implementation of the selected solutions, a new processor design environment was developed. This environment, which is called PolyCuSP, supports necessary customization flexibility to realize and evaluate the solutions given by the optimization algorithm. PolyCuSP supports rapid design space exploration and capability to model different instruction-set architectures to enable effective compari

    Optimization of DSSS Receivers Using Hardware-in-the-Loop Simulations

    Get PDF
    Over the years, there has been significant interest in defining a hardware abstraction layer to facilitate code reuse in software defined radio (SDR) applications. Designers are looking for a way to enable application software to specify a waveform, configure the platform, and control digital signal processing (DSP) functions in a hardware platform in a way that insulates it from the details of realization. This thesis presents a tool-based methodolgy for developing and optimizing a Direct Sequence Spread Spectrum (DSSS) transceiver deployed in custom hardware like Field Programmble Gate Arrays (FPGAs). The system model consists of a tranmitter which employs a quadrature phase shift keying (QPSK) modulation scheme, an additive white Gaussian noise (AWGN) channel, and a receiver whose main parts consist of an analog-to-digital converter (ADC), digital down converter (DDC), image rejection low-pass filter (LPF), carrier phase locked loop (PLL), tracking locked loop, down-sampler, spread spectrum correlators, and rectangular-to-polar converter. The design methodology is based on a new programming model for FPGAs developed in the industry by Xilinx Inc. The Xilinx System Generator for DSP software tool provides design portability and streamlines system development by enabling engineers to create and validate a system model in Xilinx FPGAs. By providing hierarchical modeling and automatic HDL code generation for programmable devices, designs can be easily verified through hardware-in-the-loop (HIL) simulations. HIL provides a significant increase in simulation speed which allows optimization of the receiver design with respect to the datapath size for different functional parts of the receiver. The parameterized datapath points used in the simulation are ADC resolution, DDC datapath size, LPF datapath size, correlator height, correlator datapath size, and rectangular-to-polar datapath size. These parameters are changed in the software enviornment and tested for bit error rate (BER) performance through real-time hardware simualtions. The final result presents a system design with minimum harware area occupancy relative to an acceptable BER degradation

    Timing-Error Tolerance Techniques for Low-Power DSP: Filters and Transforms

    Get PDF
    Low-power Digital Signal Processing (DSP) circuits are critical to commercial System-on-Chip design for battery powered devices. Dynamic Voltage Scaling (DVS) of digital circuits can reclaim worst-case supply voltage margins for delay variation, reducing power consumption. However, removing static margins without compromising robustness is tremendously challenging, especially in an era of escalating reliability concerns due to continued process scaling. The Razor DVS scheme addresses these concerns, by ensuring robustness using explicit timing-error detection and correction circuits. Nonetheless, the design of low-complexity and low-power error correction is often challenging. In this thesis, the Razor framework is applied to fixed-precision DSP filters and transforms. The inherent error tolerance of many DSP algorithms is exploited to achieve very low-overhead error correction. Novel error correction schemes for DSP datapaths are proposed, with very low-overhead circuit realisations. Two new approximate error correction approaches are proposed. The first is based on an adapted sum-of-products form that prevents errors in intermediate results reaching the output, while the second approach forces errors to occur only in less significant bits of each result by shaping the critical path distribution. A third approach is described that achieves exact error correction using time borrowing techniques on critical paths. Unlike previously published approaches, all three proposed are suitable for high clock frequency implementations, as demonstrated with fully placed and routed FIR, FFT and DCT implementations in 90nm and 32nm CMOS. Design issues and theoretical modelling are presented for each approach, along with SPICE simulation results demonstrating power savings of 21 – 29%. Finally, the design of a baseband transmitter in 32nm CMOS for the Spectrally Efficient FDM (SEFDM) system is presented. SEFDM systems offer bandwidth savings compared to Orthogonal FDM (OFDM), at the cost of increased complexity and power consumption, which is quantified with the first VLSI architecture

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Doctor of Philosophy

    Get PDF
    dissertationAbstraction plays an important role in digital design, analysis, and verification, as it allows for the refinement of functions through different levels of conceptualization. This dissertation introduces a new method to compute a symbolic, canonical, word-level abstraction of the function implemented by a combinational logic circuit. This abstraction provides a representation of the function as a polynomial Z = F(A) over the Galois field F2k , expressed over the k-bit input to the circuit, A. This representation is easily utilized for formal verification (equivalence checking) of combinational circuits. The approach to abstraction is based upon concepts from commutative algebra and algebraic geometry, notably the Grobner basis theory. It is shown that the polynomial F(A) can be derived by computing a Grobner basis of the polynomials corresponding to the circuit, using a specific elimination term order based on the circuits topology. However, computing Grobner bases using elimination term orders is infeasible for large circuits. To overcome these limitations, this work introduces an efficient symbolic computation to derive the word-level polynomial. The presented algorithms exploit i) the structure of the circuit, ii) the properties of Grobner bases, iii) characteristics of Galois fields F2k , and iv) modern algorithms from symbolic computation. A custom abstraction tool is designed to efficiently implement the abstraction procedure. While the concept is applicable to any arbitrary combinational logic circuit, it is particularly powerful in verification and equivalence checking of hierarchical, custom designed and structurally dissimilar Galois field arithmetic circuits. In most applications, the field size and the datapath size k in the circuits is very large, up to 1024 bits. The proposed abstraction procedure can exploit the hierarchy of the given Galois field arithmetic circuits. Our experiments show that, using this approach, our tool can abstract and verify Galois field arithmetic circuits up to 1024 bits in size. Contemporary techniques fail to verify these types of circuits beyond 163 bits and cannot abstract a canonical representation beyond 32 bits

    Energy Efficient Hardware Design for Securing the Internet-of-Things

    Full text link
    The Internet of Things (IoT) is a rapidly growing field that holds potential to transform our everyday lives by placing tiny devices and sensors everywhere. The ubiquity and scale of IoT devices require them to be extremely energy efficient. Given the physical exposure to malicious agents, security is a critical challenge within the constrained resources. This dissertation presents energy-efficient hardware designs for IoT security. First, this dissertation presents a lightweight Advanced Encryption Standard (AES) accelerator design. By analyzing the algorithm, a novel method to manipulate two internal steps to eliminate storage registers and replace flip-flops with latches to save area is discovered. The proposed AES accelerator achieves state-of-art area and energy efficiency. Second, the inflexibility and high Non-Recurring Engineering (NRE) costs of Application-Specific-Integrated-Circuits (ASICs) motivate a more flexible solution. This dissertation presents a reconfigurable cryptographic processor, called Recryptor, which achieves performance and energy improvements for a wide range of security algorithms across public key/secret key cryptography and hash functions. The proposed design employs circuit techniques in-memory and near-memory computing and is more resilient to power analysis attack. In addition, a simulator for in-memory computation is proposed. It is of high cost to design and evaluate new-architecture like in-memory computing in Register-transfer level (RTL). A C-based simulator is designed to enable fast design space exploration and large workload simulations. Elliptic curve arithmetic and Galois counter mode are evaluated in this work. Lastly, an error resilient register circuit, called iRazor, is designed to tolerate unpredictable variations in manufacturing process operating temperature and voltage of VLSI systems. When integrated into an ARM processor, this adaptive approach outperforms competing industrial techniques such as frequency binning and canary circuits in performance and energy.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147546/1/zhyiqun_1.pd

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings
    • 

    corecore