6 research outputs found
PIPE : Parallelized Inference Through Post-Training Quantization Ensembling of Residual Expansions
Deep neural networks (DNNs) are ubiquitous in computer vision and natural
language processing, but suffer from high inference cost. This problem can be
addressed by quantization, which consists in converting floating point
perations into a lower bit-width format. With the growing concerns on privacy
rights, we focus our efforts on data-free methods. However, such techniques
suffer from their lack of adaptability to the target devices, as a hardware
typically only support specific bit widths. Thus, to adapt to a variety of
devices, a quantization method shall be flexible enough to find good accuracy
v.s. speed trade-offs for every bit width and target device. To achieve this,
we propose PIPE, a quantization method that leverages residual error expansion,
along with group sparsity and an ensemble approximation for better
parallelization. PIPE is backed off by strong theoretical guarantees and
achieves superior performance on every benchmarked application (from vision to
NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to
ternary quantization).Comment: arXiv admin note: substantial text overlap with arXiv:2203.1464
Gradient-Based Post-Training Quantization: Challenging the Status Quo
Quantization has become a crucial step for the efficient deployment of deep
neural networks, where floating point operations are converted to simpler fixed
point operations. In its most naive form, it simply consists in a combination
of scaling and rounding transformations, leading to either a limited
compression rate or a significant accuracy drop. Recently, Gradient-based
post-training quantization (GPTQ) methods appears to be constitute a suitable
trade-off between such simple methods and more powerful, yet expensive
Quantization-Aware Training (QAT) approaches, particularly when attempting to
quantize LLMs, where scalability of the quantization process is of paramount
importance. GPTQ essentially consists in learning the rounding operation using
a small calibration set. In this work, we challenge common choices in GPTQ
methods. In particular, we show that the process is, to a certain extent,
robust to a number of variables (weight selection, feature augmentation, choice
of calibration set). More importantly, we derive a number of best practices for
designing more efficient and scalable GPTQ methods, regarding the problem
formulation (loss, degrees of freedom, use of non-uniform quantization schemes)
or optimization process (choice of variable and optimizer). Lastly, we propose
a novel importance-based mixed-precision technique. Those guidelines lead to
significant performance improvements on all the tested state-of-the-art GPTQ
methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving
the way for the design of scalable, yet effective quantization methods
SAfER: Layer-Level Sensitivity Assessment for Efficient and Robust Neural Network Inference
Deep neural networks (DNNs) demonstrate outstanding performance across most
computer vision tasks. Some critical applications, such as autonomous driving
or medical imaging, also require investigation into their behavior and the
reasons behind the decisions they make. In this vein, DNN attribution consists
in studying the relationship between the predictions of a DNN and its inputs.
Attribution methods have been adapted to highlight the most relevant weights or
neurons in a DNN, allowing to more efficiently select which weights or neurons
can be pruned. However, a limitation of these approaches is that weights are
typically compared within each layer separately, while some layers might appear
as more critical than others. In this work, we propose to investigate DNN layer
importance, i.e. to estimate the sensitivity of the accuracy w.r.t.
perturbations applied at the layer level. To do so, we propose a novel dataset
to evaluate our method as well as future works. We benchmark a number of
criteria and draw conclusions regarding how to assess DNN layer importance and,
consequently, how to budgetize layers for increased DNN efficiency (with
applications for DNN pruning and quantization), as well as robustness to
hardware failure (e.g. bit swaps)
REx: Data-Free Residual Quantization Error Expansion
Deep neural networks (DNNs) are ubiquitous in computer vision and natural
language processing, but suffer from high inference cost. This problem can be
addressed by quantization, which consists in converting floating point
operations into a lower bit-width format. With the growing concerns on privacy
rights, we focus our efforts on data-free methods. However, such techniques
suffer from their lack of adaptability to the target devices, as a hardware
typically only support specific bit widths. Thus, to adapt to a variety of
devices, a quantization method shall be flexible enough to find good accuracy
v.s. speed trade-offs for every bit width and target device. To achieve this,
we propose REx, a quantization method that leverages residual error expansion,
along with group sparsity and an ensemble approximation for better
parallelization. REx is backed off by strong theoretical guarantees and
achieves superior performance on every benchmarked application (from vision to
NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to
ternary quantization)
Archtree: on-the-fly tree-structured exploration for latency-aware pruning of deep neural networks
Deep neural networks (DNNs) have become ubiquitous in addressing a number of
problems, particularly in computer vision. However, DNN inference is
computationally intensive, which can be prohibitive e.g. when considering edge
devices. To solve this problem, a popular solution is DNN pruning, and more so
structured pruning, where coherent computational blocks (e.g. channels for
convolutional networks) are removed: as an exhaustive search of the space of
pruned sub-models is intractable in practice, channels are typically removed
iteratively based on an importance estimation heuristic. Recently, promising
latency-aware pruning methods were proposed, where channels are removed until
the network reaches a target budget of wall-clock latency pre-emptively
estimated on specific hardware. In this paper, we present Archtree, a novel
method for latency-driven structured pruning of DNNs. Archtree explores
multiple candidate pruned sub-models in parallel in a tree-like fashion,
allowing for a better exploration of the search space. Furthermore, it involves
on-the-fly latency estimation on the target hardware, accounting for closer
latencies as compared to the specified budget. Empirical results on several DNN
architectures and target hardware show that Archtree better preserves the
original model accuracy while better fitting the latency budget as compared to
existing state-of-the-art methods.Comment: 10 pages, 7 figure