677 research outputs found

    Energy-Efficient Neural Network Architectures

    Full text link
    Emerging systems for artificial intelligence (AI) are expected to rely on deep neural networks (DNNs) to achieve high accuracy for a broad variety of applications, including computer vision, robotics, and speech recognition. Due to the rapid growth of network size and depth, however, DNNs typically result in high computational costs and introduce considerable power and performance overheads. Dedicated chip architectures that implement DNNs with high energy efficiency are essential for adding intelligence to interactive edge devices, enabling them to complete increasingly sophisticated tasks by extending battery lie. They are also vital for improving performance in cloud servers that support demanding AI computations. This dissertation focuses on architectures and circuit technologies for designing energy-efficient neural network accelerators. First, a deep-learning processor is presented for achieving ultra-low power operation. Using a heterogeneous architecture that includes a low-power always-on front-end and a selectively-enabled high-performance back-end, the processor dynamically adjusts computational resources at runtime to support conditional execution in neural networks and meet performance targets with increased energy efficiency. Featuring a reconfigurable datapath and a memory architecture optimized for energy efficiency, the processor supports multilevel dynamic activation of neural network segments, performing object detection tasks with 5.3x lower energy consumption in comparison with a static execution baseline. Fabricated in 40nm CMOS, the processor test-chip dissipates 0.23mW at 5.3 fps. It demonstrates energy scalability up to 28.6 TOPS/W and can be configured to run a variety of workloads, including severely power-constrained ones such as always-on monitoring in mobile applications. To further improve the energy efficiency of the proposed heterogeneous architecture, a new charge-recovery logic family, called zero-short-circuit current (ZSCC) logic, is proposed to decrease the power consumption of the always-on front-end. By relying on dedicated circuit topologies and a four-phase clocking scheme, ZSCC operates with significantly reduced short-circuit currents, realizing order-of-magnitude power savings at relatively low clock frequencies (in the order of a few MHz). The efficiency and applicability of ZSCC is demonstrated through an ANSI S1.11 1/3 octave filter bank chip for binaural hearing aids with two microphones per ear. Fabricated in a 65nm CMOS process, this charge-recovery chip consumes 13.8µW with a 1.75MHz clock frequency, achieving 9.7x power reduction per input in comparison with a 40nm monophonic single-input chip that represents the published state of the art. The ability of ZSCC to further increase the energy efficiency of the heterogeneous neural network architecture is demonstrated through the design and evaluation of a ZSCC-based front-end. Simulation results show 17x power reduction compared with a conventional static CMOS implementation of the same architecture.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147614/1/hsiwu_1.pd

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    05141 Abstracts Collection -- Power-aware Computing Systems

    Get PDF
    From 03.04.05 to 08.04.05, the Dagstuhl Seminar 05141 ``Power-aware Computing Systems\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and discussed open problems. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are collected in this paper. The first section describes the seminar topics and goals. Links to extended abstracts or full papers are provided, if available

    Gesture Based Home Automation for the Physically Disabled

    Get PDF
    Paralysis and motor-impairments can greatly reduce the autonomy and quality of life of a patient while presenting a major recurring cost in home-healthcare. Augmented with a non-invasive wearable sensor system and home-automation equipment, the patient can regain a level of autonomy at a fraction of the cost of home nurses. A system which utilizes sensor fusion, low-power digital components, and smartphone cellular capabilities can extend the usefulness of such a system to allow greater adaptivity for patients with various needs. This thesis develops such a system as a Bluetooth enabled glove device which communicates with a remote web server to control smart-devices within the home. The power consumption of the system is considered as a major component to allow the system to operate while requiring little maintenance, allowing for greater patient autonomy. The system is evaluated in terms of power consumption and accuracy to prove its viability as a home accessibility tool

    An ultra-low voltage FFT processor using energy-aware techniques

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2004.Page 170 blank.Includes bibliographical references (p. 165-169).In a number of emerging applications such as wireless sensor networks, system lifetime depends on the energy efficiency of computation and communication. The key metric in such applications is the energy dissipated per function rather than traditional ones such as clock speed or silicon area. Hardware designs are shifting focus toward enabling energy-awareness, allowing the processor to be energy-efficient for a variety of operating scenarios. This is in contrast to conventional low-power design, which optimizes for the worst-case scenario. Here, three energy-quality scalable hooks are designed into a real-valued FFT processor: variable FFT length (N=128 to 1024 points), variable bit precision (8,16 bit), and variable voltage supply with variable clock frequency (VDD=1 80mV to 0.9V, and f=164Hz to 6MHz). A variable-bit-precision and variable-FFT-length scalable FFT ASIC using an off-the-shelf standard-cell logic library and memory only scales down to 1V operation. Further energy savings is achieved through ultra-low voltage-supply operation. As performance requirements are relaxed, the operating voltage supply is scaled down, possibly even below the threshold voltage into the subthreshold region. When lower frequencies cause leakage energy dissipation to exceed the active energy dissipation, there is an optimal operating point for minimizing energy consumption.(cont.) Logic and memory design techniques allowing ultra-low voltage operation are employed to study the optimal frequency/voltage operating point for the FFT. A full-custom implementation with circuit techniques optimized for deep voltage scaling into the subthreshold regime, is fabricated using a standard CMOS 0.18[mu]m logic process and functions down to 180mV. At the optimal operating point where the voltage supply is 350mV, the FFT processor dissipates 155nJ/FFT. The custom FFT is 8x more energy-efficient than the ASIC implementation and 350x more energy-efficient than a low-power microprocessor implementation.by Alice Wang.Ph.D

    Exploiting Natural On-chip Redundancy for Energy Efficient Memory and Computing

    Get PDF
    Power density is currently the primary design constraint across most computing segments and the main performance limiting factor. For years, industry has kept power density constant, while increasing frequency, lowering transistors supply (Vdd) and threshold (Vth) voltages. However, Vth scaling has stopped because leakage current is exponentially related to it. Transistor count and integration density keep doubling every process generation (Moore’s Law), but the power budget caps the amount of hardware that can be active at the same time, leading to dark silicon. With each new generation, there are more resources available, but we cannot fully exploit their performance potential. In the last years, different research trends have explored how to cope with dark silicon and unlock the energy efficiency of the chips, including Near-Threshold voltage Computing (NTC) and approximate computing. NTC aggressively lowers Vdd to values near Vth. This allows a substantial reduction in power, as dynamic power scales quadratically with supply voltage. The resultant power reduction could be used to activate more chip resources and potentially achieve performance improvements. Unfortunately, Vdd scaling is limited by the tight functionality margins of on-chip SRAM transistors. When scaling Vdd down to values near-threshold, manufacture-induced parameter variations affect the functionality of SRAM cells, which eventually become not reliable. A large amount of emerging applications, on the other hand, features an intrinsic error-resilience property, tolerating a certain amount of noise. In this context, approximate computing takes advantage of this observation and exploits the gap between the level of accuracy required by the application and the level of accuracy given by the computation, providing that reducing the accuracy translates into an energy gain. However, deciding which instructions and data and which techniques are best suited for approximation still poses a major challenge. This dissertation contributes in these two directions. First, it proposes a new approach to mitigate the impact of SRAM failures due to parameter variation for effective operation at ultra-low voltages. We identify two levels of natural on-chip redundancy: cache level and content level. The first arises because of the replication of blocks in multi-level cache hierarchies. We exploit this redundancy with a cache management policy that allocates blocks to entries taking into account the nature of the cache entry and the use pattern of the block. This policy obtains performance improvements between 2% and 34%, with respect to block disabling, a technique with similar complexity, incurring no additional storage overhead. The latter (content level redundancy) arises because of the redundancy of data in real world applications. We exploit this redundancy compressing cache blocks to fit them in partially functional cache entries. At the cost of a slight overhead increase, we can obtain performance within 2% of that obtained when the cache is built with fault-free cells, even if more than 90% of the cache entries have at least a faulty cell. Then, we analyze how the intrinsic noise tolerance of emerging applications can be exploited to design an approximate Instruction Set Architecture (ISA). Exploiting the ISA redundancy, we explore a set of techniques to approximate the execution of instructions across a set of emerging applications, pointing out the potential of reducing the complexity of the ISA, and the trade-offs of the approach. In a proof-of-concept implementation, the ISA is shrunk in two dimensions: Breadth (i.e., simplifying instructions) and Depth (i.e., dropping instructions). This proof-of-concept shows that energy can be reduced on average 20.6% at around 14.9% accuracy loss

    Low Power Decoding Circuits for Ultra Portable Devices

    Get PDF
    A wide spread of existing and emerging battery driven wireless devices do not necessarily demand high data rates. Rather, ultra low power, portability and low cost are the most desired characteristics. Examples of such applications are wireless sensor networks (WSN), body area networks (BAN), and a variety of medical implants and health-care aids. Being small, cheap and low power for the individual transceiver nodes, let those to be used in abundance in remote places, where access for maintenance or recharging the battery is limited. In such scenarios, the lifetime of the battery, in most cases, determines the lifetime of the individual nodes. Therefore, energy consumption has to be so low that the nodes remain operational for an extended period of time, even up to a few years. It is known that using error correcting codes (ECC) in a wireless link can potentially help to reduce the transmit power considerably. However, the power consumption of the coding-decoding hardware itself is critical in an ultra low power transceiver node. Power and silicon area overhead of coding-decoding circuitry needs to be kept at a minimum in the total energy and cost budget of the transceiver node. In this thesis, low power approaches in decoding circuits in the framework of the mentioned applications and use cases are investigated. The presented work is based on the 65nm CMOS technology and is structured in four parts as follows: In the first part, goals and objectives, background theory and fundamentals of the presented work is introduced. Also, the ECC block in coordination with its surrounding environment, a low power receiver chain, is presented. Designing and implementing an ultra low power and low cost wireless transceiver node introduces challenges that requires special considerations at various levels of abstraction. Similarly, a competitive solution often occurs after a conclusive design space exploration. The proposed decoder circuits in the following parts are designed to be embedded in the low power receiver chain, that is introduced in the first part. Second part, explores analog decoding method and its capabilities to be embedded in a compact and low power transceiver node. Analog decod- ing method has been theoretically introduced over a decade ago that followed with early proof of concept circuits that promised it to be a feasible low power solution. Still, with the increased popularity of low power sensor networks, it has not been clear how an analog decoding approach performs in terms of power, silicon area, data rate and integrity of calculations in recent technologies and for low data rates. Ultra low power budget, small size requirement and more relaxed demands on data rates suggests a decoding circuit with limited complexity. Therefore, the four-state (7,5) codes are considered for hardware implementation. Simulations to chose the critical design factors are presented. Consequently, to evaluate critical specifications of the decoding circuit, three versions of analog decoding circuit with different transistor dimensions fabricated. The measurements results reveal different trade-off possibilities as well as the potentials and limitations of the analog decoding approach for the target applications. Measurements seem to be crucial, since the available computer-aided design (CAD) tools provide limited assistance and precision, given the amount of calculations and parameters that has to be included in the simulations. The largest analog decoding core (AD1) takes 0.104mm2 on silicon and the other two (AD2 and AD3) take 0.035mm2 and 0.015mm2, respectively. Consequently, coding gain in trade-off with silicon area and throughput is presented. The analog decoders operate with 0.8V supply. The achieved coding gain is 2.3 dB at bit error rates (BER)=0.001 and 10 pico-Joules per bit (pJ/b) energy efficiency is reached at 2 Mbps. Third part of this thesis, proposes an alternative low power digital decoding approach for the same codes. The desired compact and low power goal has been pursued by designing an equivalent digital decoding circuit that is fabricated in 65nm CMOS technology and operates in low voltage (near-threshold) region. The architecture of the design is optimized in system and circuit levels to propose a competitive digital alternative. Similarly, critical specifications of the decoder in terms of power, area, data rate (speed) and integrity are reported according to the measurements. The digital implementation with 0.11mm2 area, consumes minimum energy at 0.32V supply which gives 9 pJ/b energy efficiency at 125 kb/s and 2.9 dB coding gain at BER=0.001. The forth and last part, compares the proposed design alternatives based on the fabricated chips and the results attained from the measurements to conclude the most suitable solution for the considered target applications. Advantages and disadvantages of both approaches are discussed. Possible extensions of this work is introduced as future work

    Low Power Digital Filter Implementation in FPGA

    Get PDF
    Digital filters suitable for hearing aid application on low power perspective have been developed and implemented in FPGA in this dissertation. Hearing aids are primarily meant for improving hearing and speech comprehensions. Digital hearing aids score over their analog counterparts. This happens as digital hearing aids provide flexible gain besides facilitating feedback reduction and noise elimination. Recent advances in DSP and Microelectronics have led to the development of superior digital hearing aids. Many researchers have investigated several algorithms suitable for hearing aid application that demands low noise, feedback cancellation, echo cancellation, etc., however the toughest challenge is the implementation. Furthermore, the additional constraints are power and area. The device must consume as minimum power as possible to support extended battery life and should be as small as possible for increased portability. In this thesis we have made an attempt to investigate possible digital filter algorithms those are hardware configurable on low power view point. Suitability of decimation filter for hearing aid application is investigated. In this dissertation decimation filter is implemented using ‘Distributed Arithmetic’ approach.While designing this filter, it is observed that, comb-half band FIR-FIR filter design uses less hardware compared to the comb-FIR-FIR filter design. The power consumption is also less in case of comb-half band FIR-FIR filter design compared to the comb-FIR-FIR filter. This filter is implemented in Virtex-II pro board from Xilinx and the resource estimator from the system generator is used to estimate the resources. However ‘Distributed Arithmetic’ is highly serial in nature and its latency is high; power consumption found is not very low in this type of filter implementation. So we have proceeded for ‘Adaptive Hearing Aid’ using Booth-Wallace tree multiplier. This algorithm is also implemented in FPGA and power calculation of the whole system is done using Xilinx Xpower analyser. It is observed that power consumed by the hearing aid with Booth-Wallace tree multiplier is less than the hearing aid using Booth multiplier (about 25%). So we can conclude that the hearing aid using Booth-Wallace tree multiplier consumes less power comparatively. The above two approached are purely algorithmic approach. Next we proceed to combine circuit level VLSI design and with algorithmic approach for further possible reduction in power. A MAC based FDF-FIR filter (algorithm) that uses dual edge triggered latch (DET) (circuit) is used for hearing aid device. It is observed that DET based MAC FIR filter consumes less power than the traditional (single edge triggered, SET) one (about 41%). The proposed low power latch provides a power saving upto 65% in the FIR filter. This technique consumes less power compared to previous approaches that uses low power technique only at algorithmic abstraction level. The DET based MAC FIR filter is tested for real-time validation and it is observed that it works perfectly for various signals (speech, music, voice with music). The gain of the filter is tested and is found to be 27 dB (maximum) that matches with most of the hearing aid (manufacturer’s) specifications. Hence it can be concluded that FDF FIR digital filter in conjunction with low power latch is a strong candidate for hearing aid application

    Center for Aeronautics and Space Information Sciences

    Get PDF
    This report summarizes the research done during 1991/92 under the Center for Aeronautics and Space Information Science (CASIS) program. The topics covered are computer architecture, networking, and neural nets
    corecore