400 research outputs found
Bayes-Optimal Joint Channel-and-Data Estimation for Massive MIMO with Low-Precision ADCs
This paper considers a multiple-input multiple-output (MIMO) receiver with
very low-precision analog-to-digital convertors (ADCs) with the goal of
developing massive MIMO antenna systems that require minimal cost and power.
Previous studies demonstrated that the training duration should be {\em
relatively long} to obtain acceptable channel state information. To address
this requirement, we adopt a joint channel-and-data (JCD) estimation method
based on Bayes-optimal inference. This method yields minimal mean square errors
with respect to the channels and payload data. We develop a Bayes-optimal JCD
estimator using a recent technique based on approximate message passing. We
then present an analytical framework to study the theoretical performance of
the estimator in the large-system limit. Simulation results confirm our
analytical results, which allow the efficient evaluation of the performance of
quantized massive MIMO systems and provide insights into effective system
design.Comment: accepted in IEEE Transactions on Signal Processin
Modulation Diversity in Fading Channels with Quantized Receiver
In this paper, we address the design of codes which achieve modulation
diversity in block fading single-input single-output (SISO) channels with
signal quantization at receiver and low-complexity decoding. With an
unquantized receiver, coding based on algebraic rotations is known to achieve
modulation coding diversity. On the other hand, with a quantized receiver,
algebraic rotations may not guarantee diversity. Through analysis, we propose
specific rotations which result in the codewords having equidistant
component-wise projections. We show that the proposed coding scheme achieves
maximum modulation diversity with a low-complexity minimum distance decoder and
perfect channel knowledge. Relaxing the perfect channel knowledge assumption we
propose a novel training/estimation and receiver control technique to estimate
the channel. We show that our coding/training/estimation scheme and minimum
distance decoding achieve an error probability performance similar to that
achieved with perfect channel knowledge
Linear Precoding with Low-Resolution DACs for Massive MU-MIMO-OFDM Downlink
We consider the downlink of a massive multiuser (MU) multiple-input
multiple-output (MIMO) system in which the base station (BS) is equipped with
low-resolution digital-to-analog converters (DACs). In contrast to most
existing results, we assume that the system operates over a frequency-selective
wideband channel and uses orthogonal frequency division multiplexing (OFDM) to
simplify equalization at the user equipments (UEs). Furthermore, we consider
the practically relevant case of oversampling DACs. We theoretically analyze
the uncoded bit error rate (BER) performance with linear precoders (e.g., zero
forcing) and quadrature phase-shift keying using Bussgang's theorem. We also
develop a lower bound on the information-theoretic sum-rate throughput
achievable with Gaussian inputs, which can be evaluated in closed form for the
case of 1-bit DACs. For the case of multi-bit DACs, we derive approximate, yet
accurate, expressions for the distortion caused by low-precision DACs, which
can be used to establish lower bounds on the corresponding sum-rate throughput.
Our results demonstrate that, for a massive MU-MIMO-OFDM system with a
128-antenna BS serving 16 UEs, only 3--4 DAC bits are required to achieve an
uncoded BER of 10^-4 with a negligible performance loss compared to the
infinite-resolution case at the cost of additional out-of-band emissions.
Furthermore, our results highlight the importance of taking into account the
inherent spatial and temporal correlations caused by low-precision DACs
Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators
Analog in-memory computing (AIMC) -- a promising approach for
energy-efficient acceleration of deep learning workloads -- computes
matrix-vector multiplications (MVMs) but only approximately, due to
nonidealities that often are non-deterministic or nonlinear. This can adversely
impact the achievable deep neural network (DNN) inference accuracy as compared
to a conventional floating point (FP) implementation. While retraining has
previously been suggested to improve robustness, prior work has explored only a
few DNN topologies, using disparate and overly simplified AIMC hardware models.
Here, we use hardware-aware (HWA) training to systematically examine the
accuracy of AIMC for multiple common artificial intelligence (AI) workloads
across multiple DNN topologies, and investigate sensitivity and robustness to a
broad set of nonidealities. By introducing a new and highly realistic AIMC
crossbar-model, we improve significantly on earlier retraining approaches. We
show that many large-scale DNNs of various topologies, including convolutional
neural networks (CNNs), recurrent neural networks (RNNs), and transformers, can
in fact be successfully retrained to show iso-accuracy on AIMC. Our results
further suggest that AIMC nonidealities that add noise to the inputs or
outputs, not the weights, have the largest impact on DNN accuracy, and that
RNNs are particularly robust to all nonidealities.Comment: 35 pages, 7 figures, 5 table
Using the IBM Analog In-Memory Hardware Acceleration Kit for Neural Network Training and Inference
Analog In-Memory Computing (AIMC) is a promising approach to reduce the
latency and energy consumption of Deep Neural Network (DNN) inference and
training. However, the noisy and non-linear device characteristics, and the
non-ideal peripheral circuitry in AIMC chips, require adapting DNNs to be
deployed on such hardware to achieve equivalent accuracy to digital computing.
In this tutorial, we provide a deep dive into how such adaptations can be
achieved and evaluated using the recently released IBM Analog Hardware
Acceleration Kit (AIHWKit), freely available at https://github.com/IBM/aihwkit.
The AIHWKit is a Python library that simulates inference and training of DNNs
using AIMC. We present an in-depth description of the AIHWKit design,
functionality, and best practices to properly perform inference and training.
We also present an overview of the Analog AI Cloud Composer, that provides the
benefits of using the AIHWKit simulation platform in a fully managed cloud
setting. Finally, we show examples on how users can expand and customize
AIHWKit for their own needs. This tutorial is accompanied by comprehensive
Jupyter Notebook code examples that can be run using AIHWKit, which can be
downloaded from https://github.com/IBM/aihwkit/tree/master/notebooks/tutorial
Development of a low-cost multi-camera star tracker for small satellites
This thesis presents a novel small satellite star tracker that uses an array of low-cost, off the shelf imaging sensors to achieve high accuracy attitude determination performance. The theoretical analysis of improvements in star detectability achieved by stacking images from multiple cameras is presented. An image processing algorithm is developed to combine images from multiple cameras with arbitrary focal lengths, principal point offsets, distortions, and misalignments. The star tracker also implements other algorithms including the region growing algorithm, the intensity weighted centroid algorithm, the geometric voting algorithm for star identification, and the singular value decomposition algorithm for attitude determination. A star tracker software simulator is used to test the algorithms by generating star images with sensor noises, lens defocusing, and lens distortion. A hardware prototype is being assembled for eventual night sky testing to verify simulated performance levels. Star tracker flight hardware is being developed in the Laboratory for Advanced Space Systems at Illinois (LASSI) at the University of Illinois at Urbana Champaign for future CubeSat missions
Finite precision deep learning with theoretical guarantees
Recent successes of deep learning have been achieved at the expense of a very high computational and parameter complexity. Today, deployment of both inference and training of deep neural networks (DNNs) is predominantly in the cloud. A recent alternative trend is to deploy DNNs onto untethered, resource-constrained platforms at the Edge. To realize on-device intelligence, the gap between algorithmic requirements and available resources needs to be closed. One popular way of doing so is via implementation in finite precision.
While ad-hoc trial and error techniques in finite precision deep learning abound, theoretical guarantees on network accuracy are elusive. The work presented in this dissertation builds a theoretical framework for the implementation of deep learning in finite precision. For inference, we theoretically analyze the worst-case accuracy drop in the presence of weight and activation quantization. Furthermore, we derive an optimal clipping criterion (OCC) to minimize the precision of dot-product outputs. For implementations using in-memory computing, OCC lowers ADC precision requirements. We analyze fixed-point training and present a methodology for implementing quantized back-propagation with close-to-minimal per-tensor precision. Finally, we study accumulator precision for reduced precision floating-point training using variance analysis techniques.
We first introduce our work on fixed-point inference with accuracy guarantees. Theoretical bounds on the mismatch between limited and full precision networks are derived. Proper precision assignment can be readily obtained employing these bounds, and weight-activation, as well as per-layer precision trade-offs, are derived. Applied to a variety of networks and datasets, the presented analysis is found to be tight to within 2 bit. Furthermore, it is shown that a minimum precision network can have up to lower hardware complexity than a binarized network at iso-accuracy. In general, a minimum precision network can reduce complexity by up to compared to a full precision baseline while maintaining accuracy. Per-layer precision analysis indicates that precision requirements of common networks vary from 2 bit to 10 bit to guarantee an accuracy close to the floating-point baseline.
Then, we study DNN implementation using in-memory computing (IMC), where we propose OCC to minimize the column ADC precision. The signal-to-quantization-noise ratio (SQNR) of OCC is shown to be within 0.8 dB of the well-known optimal Lloyd-Max quantizer. OCC improves the SQNR of the commonly employed full range quantizer by 14 dB which translates to a 3 bit ADC precision reduction. The input-serial weight-parallel (ISWP) IMC architecture is studied. Using bit-slicing techniques, significant energy savings can be achieved with minimal accuracy lost. Indeed, we prove that a dot-product can be realized with single memory access while suffering no more than 2 dB SQNR drop. Combining the proposed OCC and ISWP noise analysis with our proposed DNN precision analysis, we demonstrate reduction of energy consumption in DNN implementation at iso-accuracy.
Furthermore, we study the quantization of the back-propagation training algorithm. We propose a systematic methodology to obtain close-to-minimal per-layer precision requirements for the guaranteed statistical similarity between fixed-point and floating-point training. The challenges of quantization noise, inter-layer and intra-layer precision trade-offs, dynamic range, and stability are jointly addressed. Applied to several benchmarks, fixed-point training is demonstrated to achieve high fidelity to the baseline with an accuracy drop no greater than 0.56\%. The derived precision assignment is shown to be within 1 bit per tensor of the minimum. The methodology is found to reduce representational, computational, and communication costs of training by up to , , and , respectively, compared to the baseline and related works.
Finally, we address the problem of reduced precision floating-point training. In particular, we study accumulation precision requirements. We present the variance retention ratio (VRR), an analytical metric measuring the suitability of accumulation mantissa precision. The analysis expands on concepts employed in variance engineering for weight initialization. An analytical expression for the VRR is derived and used to determine accumulation bit-width for precise tailoring of computation hardware. The VRR also quantifies the benefits of effective summation reduction techniques such as chunked accumulation and sparsification. Experimentally, the validity and tightness of our analysis are verified across multiple deep learning benchmarks
- …