1,058 research outputs found
One-Bit Algorithm Considerations for Sparse PMCW Radar
Phase Modulated Continuous Wave (PMCW) radar an emerging technology for autonomous cars. It is more flexible than the current frequency modulated systems, offering better detection resolution, interference mitigation, and future development opportunities. The issue preventing PMCW adoption is the need for high sample-rate analog to digital converters (ADCs). Due to device limits, a large increase in cost and power consumption occurs for every added resolution bit for a given sampling rate. This thesis explores radar detection techniques for few-bit and 1-bit ADC measurements. 1-bit quantization typically results in poor amplitude estimation, which can limit detections if the target signals are weak. Time Varying quantization Thresholds (TVTs) are a way to preserve that amplitude information.
An existing few-bit Fast Iterative Shrinkage Thresholding Algorithm (FISTA) was adapted to use 1-bit TVT quantization. Three test scenarios compared the original FISTA using 1 and 2-bit quantization to the TVT approach. Tests included widely spaced targets, adjacent targets, and high dynamic range targets. Performance metrics included normalized mean squared error (NMSE) of target amplitude estimation and Receiver operating characteristic (ROC) curves for detection accuracy. Results showed the TVT implementation operated over the widest range of SNR values, had the lowest amplitude estimate NMSE at high SNR, and comparable NMSE with 2-bit FISTA at low SNR. There was an reduction in NMSE compared to 1-bit FISTA without TVTs. Few-bit FISTA had the best detection rates at specific SNR values, but was more sensitive to noise. AUC values averaged across the full SNR range for TVT FISTA were the most robust, measuring higher than 1-bit FISTA and higher than 2-bit FISTA.
Advisor: Andrew Harm
Machine Learning for Microcontroller-Class Hardware -- A Review
The advancements in machine learning opened a new opportunity to bring
intelligence to the low-end Internet-of-Things nodes such as microcontrollers.
Conventional machine learning deployment has high memory and compute footprint
hindering their direct deployment on ultra resource-constrained
microcontrollers. This paper highlights the unique requirements of enabling
onboard machine learning for microcontroller class devices. Researchers use a
specialized model development workflow for resource-limited applications to
ensure the compute and latency budget is within the device limits while still
maintaining the desired performance. We characterize a closed-loop widely
applicable workflow of machine learning model development for microcontroller
class devices and show that several classes of applications adopt a specific
instance of it. We present both qualitative and numerical insights into
different stages of model development by showcasing several use cases. Finally,
we identify the open research challenges and unsolved questions demanding
careful considerations moving forward.Comment: Accepted for publication at IEEE Sensors Journa
Time and frequency domain algorithms for speech coding
The promise of digital hardware economies (due to recent advances in
VLSI technology), has focussed much attention on more complex and sophisticated
speech coding algorithms which offer improved quality at relatively
low bit rates.
This thesis describes the results (obtained from computer simulations)
of research into various efficient (time and frequency domain) speech
encoders operating at a transmission bit rate of 16 Kbps.
In the time domain, Adaptive Differential Pulse Code Modulation (ADPCM)
systems employing both forward and backward adaptive prediction were
examined. A number of algorithms were proposed and evaluated, including
several variants of the Stochastic Approximation Predictor (SAP). A
Backward Block Adaptive (BBA) predictor was also developed and found to
outperform the conventional stochastic methods, even though its complexity
in terms of signal processing requirements is lower. A simplified
Adaptive Predictive Coder (APC) employing a single tap pitch predictor
considered next provided a slight improvement in performance over ADPCM,
but with rather greater complexity.
The ultimate test of any speech coding system is the perceptual performance
of the received speech. Recent research has indicated that this
may be enhanced by suitable control of the noise spectrum according to
the theory of auditory masking. Various noise shaping ADPCM
configurations were examined, and it was demonstrated that a proposed
pre-/post-filtering arrangement which exploits advantageously the
predictor-quantizer interaction, leads to the best subjective
performance in both forward and backward prediction systems.
Adaptive quantization is instrumental to the performance of ADPCM systems.
Both the forward adaptive quantizer (AQF) and the backward oneword
memory adaptation (AQJ) were examined. In addition, a novel method
of decreasing quantization noise in ADPCM-AQJ coders, which involves the
application of correction to the decoded speech samples, provided
reduced output noise across the spectrum, with considerable high frequency
noise suppression.
More powerful (and inevitably more complex) frequency domain speech
coders such as the Adaptive Transform Coder (ATC) and the Sub-band Coder
(SBC) offer good quality speech at 16 Kbps. To reduce complexity and
coding delay, whilst retaining the advantage of sub-band coding, a novel
transform based split-band coder (TSBC) was developed and found to compare
closely in performance with the SBC.
To prevent the heavy side information requirement associated with a
large number of bands in split-band coding schemes from impairing coding
accuracy, without forgoing the efficiency provided by adaptive bit
allocation, a method employing AQJs to code the sub-band signals together
with vector quantization of the bit allocation patterns was also
proposed.
Finally, 'pipeline' methods of bit allocation and step size estimation
(using the Fast Fourier Transform (FFT) on the input signal) were examined.
Such methods, although less accurate, are nevertheless useful in
limiting coding delay associated with SRC schemes employing Quadrature
Mirror Filters (QMF)
Model Compression and Efficient Inference for Large Language Models: A Survey
Transformer based large language models have achieved tremendous success.
However, the significant memory and computational costs incurred during the
inference process make it challenging to deploy large models on
resource-constrained devices. In this paper, we investigate compression and
efficient inference methods for large language models from an algorithmic
perspective. Regarding taxonomy, similar to smaller models, compression and
acceleration algorithms for large language models can still be categorized into
quantization, pruning, distillation, compact architecture design, dynamic
networks. However, Large language models have two prominent characteristics
compared to smaller models: (1) Most of compression algorithms require
finetuning or even retraining the model after compression. The most notable
aspect of large models is the very high cost associated with model finetuning
or training. Therefore, many algorithms for large models, such as quantization
and pruning, start to explore tuning-free algorithms. (2) Large models
emphasize versatility and generalization rather than performance on a single
task. Hence, many algorithms, such as knowledge distillation, focus on how to
preserving their versatility and generalization after compression. Since these
two characteristics were not very pronounced in early large models, we further
distinguish large language models into medium models and ``real'' large models.
Additionally, we also provide an introduction to some mature frameworks for
efficient inference of large models, which can support basic compression or
acceleration algorithms, greatly facilitating model deployment for users.Comment: 47 pages, review 380 papers. The work is ongoin
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Large language models (LLMs) face the challenges in fine-tuning and
deployment due to their high memory demands and computational costs. While
parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage
of the optimizer state during fine-tuning, the inherent size of pre-trained LLM
weights continues to be a pressing concern. Even though quantization techniques
are widely proposed to ease memory demands and accelerate LLM inference, most
of these techniques are geared towards the deployment phase. To bridge this
gap, this paper presents Parameter-Efficient and Quantization-aware Adaptation
(PEQA) - a simple yet effective method that combines the advantages of PEFT
with quantized LLMs. By updating solely the quantization scales, PEQA can be
directly applied to quantized LLMs, ensuring seamless task transitions.
Parallel to existing PEFT methods, PEQA significantly reduces the memory
overhead associated with the optimizer state. Furthermore, it leverages the
advantages of quantization to substantially reduce model sizes. Even after
fine-tuning, the quantization structure of a PEQA-tuned LLM remains intact,
allowing for accelerated inference on the deployment stage. We employ
PEQA-tuning for task-specific adaptation on LLMs with up to 65 billion
parameters. To assess the logical reasoning and language comprehension of
PEQA-tuned LLMs, we fine-tune low-bit quantized LLMs using a instruction
dataset. Our results show that even when LLMs are quantized to below 4-bit
precision, their capabilities in language modeling, few-shot in-context
learning, and comprehension can be resiliently restored to (or even improved
over) their full-precision original performances with PEQA.Comment: Published at NeurIPS 2023. Camera-ready versio
Cross-Layer Automated Hardware Design for Accuracy-Configurable Approximate Computing
Approximate Computing trades off computation accuracy against performance or energy efficiency. It is a design paradigm that arose in the last decade as an answer to diminishing returns from Dennard\u27s scaling and a shift in the prominent workloads. A range of modern workloads, categorized mainly as recognition, mining, and synthesis, features an inherent tolerance to approximations. Their characteristics, such as redundancies in their input data and robust-to-noise algorithms, allow them to produce outputs of acceptable quality, despite an approximation in some of their computations. Approximate Computing leverages the application tolerance by relaxing the exactness in computation towards primary design goals of increasing performance or improving energy efficiency. Existing techniques span across the abstraction layers of computer systems where cross-layer techniques are shown to offer a larger design space and yield higher savings. Currently, the majority of the existing work aims at meeting a single accuracy. The extent of approximation tolerance, however, significantly varies with a change in input characteristics and applications.
In this dissertation, methods and implementations are presented for cross-layer and automated design of accuracy-configurable Approximate Computing to maximally exploit the performance and energy benefits. In particular, this dissertation addresses the following challenges and introduces novel contributions:
A main Approximate Computing category in hardware is to scale either voltage or frequency beyond the safe limits for power or performance benefits, respectively. The rationale is that timing errors would be gradual and for an initial range tolerable. This scaling enables a fine-grain accuracy-configurability by varying the timing error occurrence. However, conventional synthesis tools aim at meeting a single delay for all paths within the circuit. Subsequently, with voltage or frequency scaling, either all paths succeed, or a large number of paths fail simultaneously, with a steep increase in error rate and magnitude. This dissertation presents an automated method for minimizing path delays by individually constraining the primary outputs of combinational circuits. As a result, it reduces the number of failing paths and makes the timing errors significantly more gradual, and also rarer and smaller on average. Additionally, it reveals that delays can be significantly reduced towards the least significant bit (LSB) and allows operating at a higher frequency when small operands are computed.
Precision scaling, i.e., reducing the representation of data and its accuracy is widely used in multiple abstraction layers in Approximate Computing. Reducing data precision also reduces the transistor toggles, and therefore the dynamic power consumption. Application and architecture level precision scaling results in using only LSBs of the circuit. Arithmetic circuits often have less complexity and logic depth in LSBs compared to most significant bits (MSB). To take advantage of this circuit property, a delay-altering synthesis methodology is proposed. The method finds energy-optimal delay values under configurable precision usage and assigns them to primary outputs used for different precisions. Thereby, it enables dynamic frequency-precision scalable circuits for energy efficiency.
Within the hardware architecture, it is possible to instantiate multiple units with the same functionality with different fixed approximation levels, where each block benefits from having fewer transistors and also synthesis relaxations. These blocks can be selected dynamically and thus allow to configure the accuracy during runtime. Instantiating such approximate blocks can be a lower dynamic power but higher area and leakage cost alternative to the current state-of-the-art gating mechanisms which switch off a group of paths in the circuit to reduce the toggling activity. Jointly, instantiating multiple blocks and gating mechanisms produce a large design space of accuracy-configurable hardware, where energy-optimal solutions require a cross-layer search in architecture and circuit levels. To that end, an approximate hardware synthesis methodology is proposed with joint optimizations in architecture and circuit for dynamic accuracy scaling, and thereby it enables energy vs. area trade-offs
- …