554 research outputs found

    Implementation of a Combined OFDM-Demodulation and WCDMA-Equalization Module

    Get PDF
    For a dual-mode baseband receiver for the OFDMWireless LAN andWCDMA standards, integration of the demodulation and equalization tasks on a dedicated hardware module has been investigated. For OFDM demodulation, an FFT algorithm based on cascaded twiddle factor decomposition has been selected. This type of algorithm combines high spatial and temporal regularity in the FFT data-flow graphs with a minimal number of computations. A frequency-domain algorithm based on a circulant channel approximation has been selected for WCDMA equalization. It has good performance, low hardware complexity and a low number of computations. Its main advantage is the reuse of the FFT kernel, which contributes to the integration of both tasks. The demodulation and equalization module has been described at the register transfer level with the in-house developed Arx language. The core of the module is a pipelined radix-23 butterfly combined with a complex multiplier and complex divider. The module has an area of 0.447 mm2 in 0.18 ¿m technology and a power consumption of 10.6 mW. The proposed module compares favorably with solutions reported in literature

    Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation

    Get PDF
    Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture. Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be used as general-purpose arithmetic unit. Due to the support of parallel half-precision computations and mixed-precision dot-product computations, it is especially suitable for deep learning computation. For the design of deep learning specific computation unit, more optimizations can be performed. First, a fixed-point and floating-point merged multiply-accumulate (MAC) unit is proposed. As deep learning computation can be accomplished with low precision number formats, the support of high precision floating-point operations can be eliminated. In this design, the half-precision floating-point format is supported to provide a large dynamic range to handle small gradients for deep learning training. For deep learning inference, 8-bit fixed-point 2-term dot-product computation is supported. Second, a flexible multiple-precision MAC unit architecture is proposed. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Finally, a MAC architecture based on the posit format, a promising numerical format in deep learning computation, is proposed to facilitate the use of posit format in deep learning computation. In addition to the above mention arithmetic units, an improved hybrid memory cube (HMC) architecture is proposed for weight-sharing deep neural network processing. By modifying the HMC instruction set and HMC logic layer, the major part of the deep learning computation can be accomplished inside memory. The proposed design reduces the memory bandwidth requirements and thus reduces the energy consumed by memory data transfer

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    Design of high-speed and low-power finite-word-length PID controllers.

    No full text
    International audienceASIC or FPGA implementation of a finite word-length PID controller requires a double expertise : in control system and hardware design. In this paper, we only focus on the hardware side of the problem. We show how to design configurable fixed-point PIDs to satisfy application srequiring minimal power consumption, or high control-rate, or both together. As multiply operation is the engine of PID, we experienced three algorithms : Booth, modified Booth, and a new recursive multi-bit multiplication algorithm. This later enables the construction of finely grained PID structures with bit-velvel and unit-time precsion. Such a feature permits to tailor the PID to the desired performance and power budget. All PIDs are emplemented at register-transfer-level (RTL) level as technology-independent reusable IP-cores. They are reconfigurable according to two compile-time constants : set-point word-length and latency. To make PID design easily reproducible, all necessary implementation details are provided and discussed

    Continuous Integration for Fast SoC Algorithm Development

    Get PDF
    Digital systems have become advanced, hard to design and optimize due to ever-growing technology. Integrated Circuits (ICs) have become more complicated due to complex computations in latest technologies. Communication systems such as mobile networks have evolved and become a part of our daily lives with the advancement in technology over the years. Hence, need of efficient, reusable and automated processes for System-on-a-Chip (SoC) development has been increased. Purpose of this thesis is to study and evaluate currently used SoC development processes and presents guidelines on how these processes can be streamlined. The thesis starts by evaluating currently used SoC development flows and their advantages and disadvantages. One important aspect is to identify step which cause duplication of work and unnecessary idle times in SoC development teams. A study is conducted and input from SoC development experts is taken in order to optimize SoC flows and use of Continuous Integration (CI) system. An algorithm model is implemented that can be used in multiple stages of SoC development at adequate complexity and is “easy enough” to be used for a person not mastering the topic. The thesis outcome is proposal for CI system in SoC development for accelerating the speed and reliability of implementing algorithms to RTL code and finally into product. CI system tool is also implemented to automate and test the model design so that it also remains up to date

    ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

    Get PDF
    Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

    Low Latency Prefix Accumulation Driven Compound MAC Unit for Efficient FIR Filter Implementation

    Get PDF
    135–138This article presents hierarchical single compound adder-based MAC with assertion based error correction for speculation variations in the prefix addition for FIR filter design. The VLSI implementation of approximation in prefix adder results show a significant delay and complexity reductions, all this at the cost of latency measures when speculation fails during carry propagation, which is the main reason preventing the use of speculation in parallel-prefix adders in DSP applications. The speculative adder which is based on Han Carlson parallel prefix adder structure accomplishes better reduction in latency. Introducing a structured and efficient shift-add technique and explore latency reduction by incorporating approximation in addition. The improvements made in terms of reduction in latency and merits in performance by the proposed MAC unit are showed through the synthesis done by FPGA hardware. Results show that proposed method outpaces both formerly projected MAC designs using multiplication methods for attaining high speed
    corecore