2,605 research outputs found
Efficient DSP and Circuit Architectures for Massive MIMO: State-of-the-Art and Future Directions
Massive MIMO is a compelling wireless access concept that relies on the use
of an excess number of base-station antennas, relative to the number of active
terminals. This technology is a main component of 5G New Radio (NR) and
addresses all important requirements of future wireless standards: a great
capacity increase, the support of many simultaneous users, and improvement in
energy efficiency. Massive MIMO requires the simultaneous processing of signals
from many antenna chains, and computational operations on large matrices. The
complexity of the digital processing has been viewed as a fundamental obstacle
to the feasibility of Massive MIMO in the past. Recent advances on
system-algorithm-hardware co-design have led to extremely energy-efficient
implementations. These exploit opportunities in deeply-scaled silicon
technologies and perform partly distributed processing to cope with the
bottlenecks encountered in the interconnection of many signals. For example,
prototype ASIC implementations have demonstrated zero-forcing precoding in real
time at a 55 mW power consumption (20 MHz bandwidth, 128 antennas, multiplexing
of 8 terminals). Coarse and even error-prone digital processing in the antenna
paths permits a reduction of consumption with a factor of 2 to 5. This article
summarizes the fundamental technical contributions to efficient digital signal
processing for Massive MIMO. The opportunities and constraints on operating on
low-complexity RF and analog hardware chains are clarified. It illustrates how
terminals can benefit from improved energy efficiency. The status of technology
and real-life prototypes discussed. Open challenges and directions for future
research are suggested.Comment: submitted to IEEE transactions on signal processin
Recommended from our members
Versatile stochastic dot product circuits based on nonvolatile memories for high performance neurocomputing and neurooptimization.
The key operation in stochastic neural networks, which have become the state-of-the-art approach for solving problems in machine learning, information theory, and statistics, is a stochastic dot-product. While there have been many demonstrations of dot-product circuits and, separately, of stochastic neurons, the efficient hardware implementation combining both functionalities is still missing. Here we report compact, fast, energy-efficient, and scalable stochastic dot-product circuits based on either passively integrated metal-oxide memristors or embedded floating-gate memories. The circuit's high performance is due to mixed-signal implementation, while the efficient stochastic operation is achieved by utilizing circuit's noise, intrinsic and/or extrinsic to the memory cell array. The dynamic scaling of weights, enabled by analog memory devices, allows for efficient realization of different annealing approaches to improve functionality. The proposed approach is experimentally verified for two representative applications, namely by implementing neural network for solving a four-node graph-partitioning problem, and a Boltzmann machine with 10-input and 8-hidden neurons
Analog MAP decoder for (8, 4) hamming code in subthreshold CMOS
Journal ArticleAn all-MOS analog implementation of a MAP decoder is presented for the (8, 4) extended Hamming code. This paper describes the design and analysis of a tail-biting trellis decoder implementation using subthreshold CMOS devices. A VLSI test chip has recently returned from fabrication, and preliminary test results indicate accurate decoding up to 20 MBit/s
CMOS analog map decoder for (8,4) hamming code
Journal ArticleAbstract-Design and test results for a fully integrated translinear tail-biting MAP error-control decoder are presented. Decoder designs have been reported for various applications which make use of analog computation, mostly for Viterbi-style decoders. MAP decoders are more complex, and are necessary components of powerful iterative decoding systems such as Turbo codes. Analog circuits may require less area and power than digital implementations in high-speed iterative applications. Our (8, 4) Hamming decoder, implemented in an AMI 0.5- m process, is the first functioning CMOS analog MAP decoder. While designed to operate in subthreshold, the decoder also functions above threshold with a small performance penalty. The chip has been tested at bit rates up to 2 Mb/s, and simulations indicate a top speed of about 10 Mb/s in strong inversion. The decoder circuit size is 0.82 mm2, and typical power consumption is 1 mW at 1 Mb/s
Large-Scale Optical Neural Networks based on Photoelectric Multiplication
Recent success in deep neural networks has generated strong interest in
hardware accelerators to improve speed and energy consumption. This paper
presents a new type of photonic accelerator based on coherent detection that is
scalable to large () networks and can be operated at high (GHz)
speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using
the massive spatial multiplexing enabled by standard free-space optical
components. In contrast to previous approaches, both weights and inputs are
optically encoded so that the network can be reprogrammed and trained on the
fly. Simulations of the network using models for digit- and
image-classification reveal a "standard quantum limit" for optical neural
networks, set by photodetector shot noise. This bound, which can be as low as
50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for
digital irreversible computation is theoretically possible in this device. The
proposed accelerator can implement both fully-connected and convolutional
networks. We also present a scheme for back-propagation and training that can
be performed in the same hardware. This architecture will enable a new class of
ultra-low-energy processors for deep learning.Comment: Text: 10 pages, 5 figures, 1 table. Supplementary: 8 pages, 5,
figures, 2 table
Low Power Decoding Circuits for Ultra Portable Devices
A wide spread of existing and emerging battery driven wireless devices do not necessarily demand high data rates. Rather, ultra low power, portability and low cost are the most desired characteristics. Examples of such applications are wireless sensor networks (WSN), body area networks (BAN), and a variety of medical implants and health-care aids. Being small, cheap and low power for the individual transceiver nodes, let those to be used in abundance in remote places, where access for maintenance or recharging the battery is limited. In such scenarios, the lifetime of the battery, in most cases, determines the lifetime of the individual nodes. Therefore, energy consumption has to be so low that the nodes remain operational for an extended period of time, even up to a few years. It is known that using error correcting codes (ECC) in a wireless link can potentially help to reduce the transmit power considerably. However, the power consumption of the coding-decoding hardware itself is critical in an ultra low power transceiver node. Power and silicon area overhead of coding-decoding circuitry needs to be kept at a minimum in the total energy and cost budget of the transceiver node. In this thesis, low power approaches in decoding circuits in the framework of the mentioned applications and use cases are investigated. The presented work is based on the 65nm CMOS technology and is structured in four parts as follows: In the first part, goals and objectives, background theory and fundamentals of the presented work is introduced. Also, the ECC block in coordination with its surrounding environment, a low power receiver chain, is presented. Designing and implementing an ultra low power and low cost wireless transceiver node introduces challenges that requires special considerations at various levels of abstraction. Similarly, a competitive solution often occurs after a conclusive design space exploration. The proposed decoder circuits in the following parts are designed to be embedded in the low power receiver chain, that is introduced in the first part. Second part, explores analog decoding method and its capabilities to be embedded in a compact and low power transceiver node. Analog decod- ing method has been theoretically introduced over a decade ago that followed with early proof of concept circuits that promised it to be a feasible low power solution. Still, with the increased popularity of low power sensor networks, it has not been clear how an analog decoding approach performs in terms of power, silicon area, data rate and integrity of calculations in recent technologies and for low data rates. Ultra low power budget, small size requirement and more relaxed demands on data rates suggests a decoding circuit with limited complexity. Therefore, the four-state (7,5) codes are considered for hardware implementation. Simulations to chose the critical design factors are presented. Consequently, to evaluate critical specifications of the decoding circuit, three versions of analog decoding circuit with different transistor dimensions fabricated. The measurements results reveal different trade-off possibilities as well as the potentials and limitations of the analog decoding approach for the target applications. Measurements seem to be crucial, since the available computer-aided design (CAD) tools provide limited assistance and precision, given the amount of calculations and parameters that has to be included in the simulations. The largest analog decoding core (AD1) takes 0.104mm2 on silicon and the other two (AD2 and AD3) take 0.035mm2 and 0.015mm2, respectively. Consequently, coding gain in trade-off with silicon area and throughput is presented. The analog decoders operate with 0.8V supply. The achieved coding gain is 2.3 dB at bit error rates (BER)=0.001 and 10 pico-Joules per bit (pJ/b) energy efficiency is reached at 2 Mbps. Third part of this thesis, proposes an alternative low power digital decoding approach for the same codes. The desired compact and low power goal has been pursued by designing an equivalent digital decoding circuit that is fabricated in 65nm CMOS technology and operates in low voltage (near-threshold) region. The architecture of the design is optimized in system and circuit levels to propose a competitive digital alternative. Similarly, critical specifications of the decoder in terms of power, area, data rate (speed) and integrity are reported according to the measurements. The digital implementation with 0.11mm2 area, consumes minimum energy at 0.32V supply which gives 9 pJ/b energy efficiency at 125 kb/s and 2.9 dB coding gain at BER=0.001. The forth and last part, compares the proposed design alternatives based on the fabricated chips and the results attained from the measurements to conclude the most suitable solution for the considered target applications. Advantages and disadvantages of both approaches are discussed. Possible extensions of this work is introduced as future work
Mixed-precision deep learning based on computational memory
Deep neural networks (DNNs) have revolutionized the field of artificial
intelligence and have achieved unprecedented success in cognitive tasks such as
image and speech recognition. Training of large DNNs, however, is
computationally intensive and this has motivated the search for novel computing
architectures targeting this application. A computational memory unit with
nanoscale resistive memory devices organized in crossbar arrays could store the
synaptic weights in their conductance states and perform the expensive weighted
summations in place in a non-von Neumann manner. However, updating the
conductance states in a reliable manner during the weight update process is a
fundamental challenge that limits the training accuracy of such an
implementation. Here, we propose a mixed-precision architecture that combines a
computational memory unit performing the weighted summations and imprecise
conductance updates with a digital processing unit that accumulates the weight
updates in high precision. A combined hardware/software training experiment of
a multilayer perceptron based on the proposed architecture using a phase-change
memory (PCM) array achieves 97.73% test accuracy on the task of classifying
handwritten digits (based on the MNIST dataset), within 0.6% of the software
baseline. The architecture is further evaluated using accurate behavioral
models of PCM on a wide class of networks, namely convolutional neural
networks, long-short-term-memory networks, and generative-adversarial networks.
Accuracies comparable to those of floating-point implementations are achieved
without being constrained by the non-idealities associated with the PCM
devices. A system-level study demonstrates 173x improvement in energy
efficiency of the architecture when used for training a multilayer perceptron
compared with a dedicated fully digital 32-bit implementation
- …