231 research outputs found
KAVUAKA: a low-power application-specific processor architecture for digital hearing aids
The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements
Complex Block Floating-Point Format with Box Encoding For Wordlength Reduction in Communication Systems
We propose a new complex block floating-point format to reduce implementation
complexity. The new format achieves wordlength reduction by sharing an exponent
across the block of samples, and uses box encoding for the shared exponent to
reduce quantization error. Arithmetic operations are performed on blocks of
samples at time, which can also reduce implementation complexity. For a case
study of a baseband quadrature amplitude modulation (QAM) transmitter and
receiver, we quantify the tradeoffs in signal quality vs. implementation
complexity using the new approach to represent IQ samples. Signal quality is
measured using error vector magnitude (EVM) in the receiver, and implementation
complexity is measured in terms of arithmetic complexity as well as memory
allocation and memory input/output rates. The primary contributions of this
paper are (1) a complex block floating-point format with box encoding of the
shared exponent to reduce quantization error, (2) arithmetic operations using
the new complex block floating-point format, and (3) a QAM transceiver case
study to quantify signal quality vs. implementation complexity tradeoffs using
the new format and arithmetic operations.Comment: 6 pages, 9 figures, submitted to Asilomar Conference on Signals,
Systems, and Computers 201
Review of Parallel Decoding of Space-time Block Codes toward 4G Wireless and Mobile Communications
AbstractThis paper presents a review of recent developments in the area of STBC decoding particularly parallel decoding of full-rate full-diversity STBCs toward real-time 4G wireless communications. After reviewing some parallel STBC decoding techniques and presenting one of the most promising types of parallel processors suitable for the 4G SDR the SIMD processor, the paper shows that parallel decoding of the Golden Code on the ClearSpeed CSX700 SIMD processor achieves a speedup of up to 30 times. The paper highlights the potential to achieve real-time decoding of high-rate STBCs with the use of robust parallel processors
Energy-Efficient Neural Network Architectures
Emerging systems for artificial intelligence (AI) are expected to rely on deep neural networks (DNNs) to achieve high accuracy for a broad variety of applications, including computer vision, robotics, and speech recognition. Due to the rapid growth of network size and depth, however, DNNs typically result in high computational costs and introduce considerable power and performance overheads. Dedicated chip architectures that implement DNNs with high energy efficiency are essential for adding intelligence to interactive edge devices, enabling them to complete increasingly sophisticated tasks by extending battery lie. They are also vital for improving performance in cloud servers that support demanding AI computations.
This dissertation focuses on architectures and circuit technologies for designing energy-efficient neural network accelerators. First, a deep-learning processor is presented for achieving ultra-low power operation. Using a heterogeneous architecture that includes a low-power always-on front-end and a selectively-enabled high-performance back-end, the processor dynamically adjusts computational resources at runtime to support conditional execution in neural networks and meet performance targets with increased energy efficiency. Featuring a reconfigurable datapath and a memory architecture optimized for energy efficiency, the processor supports multilevel dynamic activation of neural network segments, performing object detection tasks with 5.3x lower energy consumption in comparison with a static execution baseline. Fabricated in 40nm CMOS, the processor test-chip dissipates 0.23mW at 5.3 fps. It demonstrates energy scalability up to 28.6 TOPS/W and can be configured to run a variety of workloads, including severely power-constrained ones such as always-on monitoring in mobile applications.
To further improve the energy efficiency of the proposed heterogeneous architecture, a new charge-recovery logic family, called zero-short-circuit current (ZSCC) logic, is proposed to decrease the power consumption of the always-on front-end. By relying on dedicated circuit topologies and a four-phase clocking scheme, ZSCC operates with significantly reduced short-circuit currents, realizing order-of-magnitude power savings at relatively low clock frequencies (in the order of a few MHz). The efficiency and applicability of ZSCC is demonstrated through an ANSI S1.11 1/3 octave filter bank chip for binaural hearing aids with two microphones per ear. Fabricated in a 65nm CMOS process, this charge-recovery chip consumes 13.8µW with a 1.75MHz clock frequency, achieving 9.7x power reduction per input in comparison with a 40nm monophonic single-input chip that represents the published state of the art. The ability of ZSCC to further increase the energy efficiency of the heterogeneous neural network architecture is demonstrated through the design and evaluation of a ZSCC-based front-end. Simulation results show 17x power reduction compared with a conventional static CMOS implementation of the same architecture.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147614/1/hsiwu_1.pd
The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview
In today's data-centric world, where data fuels numerous application domains,
with machine learning at the forefront, handling the enormous volume of data
efficiently in terms of time and energy presents a formidable challenge.
Conventional computing systems and accelerators are continually being pushed to
their limits to stay competitive. In this context, computing near-memory (CNM)
and computing-in-memory (CIM) have emerged as potentially game-changing
paradigms. This survey introduces the basics of CNM and CIM architectures,
including their underlying technologies and working principles. We focus
particularly on CIM and CNM architectures that have either been prototyped or
commercialized. While surveying the evolving CIM and CNM landscape in academia
and industry, we discuss the potential benefits in terms of performance,
energy, and cost, along with the challenges associated with these cutting-edge
computing paradigms
Recommended from our members
Complex block floating-point format with box encoding in communication systems
This research project entails an efficient numeric digital representation in communication systems design. A complex block floating-point format with box encoding is proposed to encode an array of complex numbers that has better numeric resolution than its IEEE-754 counterpart when the same number of bits are allocated to the dominant value in the array. It is estimated that at least 10% of bit savings could be achieved by the new complex block representation on a quad-precision IEEE-754 format. A further bits savings of up to 18% could potentially be achieved for complex blocks at half-precision and single-precision IEEE-754 representation. The implementation cost of the proposed block floating-point format is evaluated in terms of memory usage, design of arithmetic units, and memory input/output rates for communications system modeling and block diagrams. Further analysis is performed on the limitation and quantization effects of this complex block format relative to complex IEEE-754 format. The coverage of the arithmetic units design include complex block adder and complex block multiplier. The appropriate systems that would be required to perform algorithms such as the fast Fourier transform (forward and inverse) are designed using the proposed complex block format in multi-stages complex block multiply-adder. The proposed block floating-point format is simulated as a new numeric class defined and implemented in MATLAB simulation environment. The MATLAB simulation is divided into two major parts. The first part of MATLAB simulation targets the simulation of complex block addition and complex block multiplication units for arbitrary size of complex samples per input block. The reference output values of complex block arithmetic are those computed with similar precision in IEEE-754 format. The second part of MATLAB simulation is performed on the system model of the single-carrier modulation-based and multi-carrier modulation-based communication systems. The quadrature amplitude modulation (QAM) is the baseband modulation type targeted in this work. The specification identified in the system model is relevant to those specified in the Long-Term Evolution (LTE) Standards for Base Station, Release 12.Electrical and Computer Engineerin
XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes
Heavily quantized fixed-point arithmetic is becoming a common approach to deploy Convolutional Neural Networks (CNNs) on limited-memory low-power IoT end-nodes. However, this trend is narrowed by the lack of support for low-bitwidth in the arithmetic units of state-of-the-art embedded Microcontrollers (MCUs). This work proposes a multi-precision arithmetic unit fully integrated into a RISC-V processor at the micro-architectural and ISA level to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64 × peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 × and 8 × faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU
- …