11 research outputs found
Binary and Ternary Natural Language Generation
Ternary and binary neural networks enable multiplication-free computation and
promise multiple orders of magnitude efficiency gains over full-precision
networks if implemented on specialized hardware. However, since both the
parameter and the output space are highly discretized, such networks have
proven very difficult to optimize. The difficulties are compounded for the
class of transformer text generation models due to the sensitivity of the
attention operation to quantization and the noise-compounding effects of
autoregressive decoding in the high-cardinality output space. We approach the
problem with a mix of statistics-based quantization for the weights and elastic
quantization of the activations and demonstrate the first ternary and binary
transformer models on the downstream tasks of summarization and machine
translation. Our ternary BART base achieves an R1 score of 41 on the
CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while
being 16x more efficient. Our binary model, while less accurate, achieves a
highly non-trivial score of 35.6. For machine translation, we achieved BLEU
scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full
precision mBART model score of 26.8. We also compare our approach in the 8-bit
activation setting, where our ternary and even binary weight models can match
or outperform the best existing 8-bit weight models in the literature. Our code
and models are available at:
https://github.com/facebookresearch/Ternary_Binary_TransformerComment: ACL 2023 Ora
Learning a Dual-Mode Speech Recognition Model via Self-Pruning
There is growing interest in unifying the streaming and full-context
automatic speech recognition (ASR) networks into a single end-to-end ASR model
to simplify the model training and deployment for both use cases. While in
real-world ASR applications, the streaming ASR models typically operate under
more storage and computational constraints - e.g., on embedded devices - than
any server-side full-context models. Motivated by the recent progress in
Omni-sparsity supernet training, where multiple subnetworks are jointly
optimized in one single model, this work aims to jointly learn a compact sparse
on-device streaming ASR model, and a large dense server non-streaming model, in
a single supernet. Next, we present that, performing supernet training on both
wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not
only substantially improve the large non-streaming model as shown in prior
works, and also be able to improve the compact sparse streaming model.Comment: 7 pages, 1 figure. Accepted for publication at IEEE Spoken Language
Technology Workshop (SLT), 202
PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion
Fusing camera with LiDAR is a promising technique to improve the accuracy of
3D detection due to the complementary physical properties. While most existing
methods focus on fusing camera features directly with raw LiDAR point clouds or
shallow 3D features, it is observed that direct deep 3D feature fusion achieves
inferior accuracy due to feature misalignment. The misalignment that originates
from the feature aggregation across large receptive fields becomes increasingly
severe for deep network stages. In this paper, we propose PathFusion to enable
path-consistent LiDAR-camera deep feature fusion. PathFusion introduces a path
consistency loss between shallow and deep features, which encourages the 2D
backbone and its fusion path to transform 2D features in a way that is
semantically aligned with the transform of the 3D backbone. We apply PathFusion
to the prior-art fusion baseline, Focals Conv, and observe more than 1.2\% mAP
improvements on the nuScenes test split consistently with and without
testing-time augmentations. Moreover, PathFusion also improves KITTI AP3D (R11)
by more than 0.6% on moderate level
Gen2Det: Generate to Detect
Recently diffusion models have shown improvement in synthetic image quality
as well as better control in generation. We motivate and present Gen2Det, a
simple modular pipeline to create synthetic training data for object detection
for free by leveraging state-of-the-art grounded image generation methods.
Unlike existing works which generate individual object instances, require
identifying foreground followed by pasting on other images, we simplify to
directly generating scene-centric images. In addition to the synthetic data,
Gen2Det also proposes a suite of techniques to best utilize the generated data,
including image-level filtering, instance-level filtering, and better training
recipe to account for imperfections in the generation. Using Gen2Det, we show
healthy improvements on object detection and segmentation tasks under various
settings and agnostic to detection methods. In the long-tailed detection
setting on LVIS, Gen2Det improves the performance on rare categories by a large
margin while also significantly improving the performance on other categories,
e.g. we see an improvement of 2.13 Box AP and 1.84 Mask AP over just training
on real data on LVIS with Mask R-CNN. In the low-data regime setting on COCO,
Gen2Det consistently improves both Box and Mask AP by 2.27 and 1.85 points. In
the most general detection setting, Gen2Det still demonstrates robust
performance gains, e.g. it improves the Box and Mask AP on COCO by 0.45 and
0.32 points
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Several post-training quantization methods have been applied to large
language models (LLMs), and have been shown to perform well down to 8-bits. We
find that these methods break down at lower bit precision, and investigate
quantization aware training for LLMs (LLM-QAT) to push quantization levels even
further. We propose a data-free distillation method that leverages generations
produced by the pre-trained model, which better preserves the original output
distribution and allows quantizing any generative model independent of its
training data, similar to post-training quantization methods. In addition to
quantizing weights and activations, we also quantize the KV cache, which is
critical for increasing throughput and support long sequence dependencies at
current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B,
at quantization levels down to 4-bits. We observe large improvements over
training-free methods, especially in the low-bit settings
LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting
This paper proposes a hardware-efficient architecture, Linearized Convolution
Network (LiCo-Net) for keyword spotting. It is optimized specifically for
low-power processor units like microcontrollers. ML operators exhibit
heterogeneous efficiency profiles on power-efficient hardware. Given the exact
theoretical computation cost, int8 operators are more computation-effective
than float operators, and linear layers are often more efficient than other
layers. The proposed LiCo-Net is a dual-phase system that uses the efficient
int8 linear operators at the inference phase and applies streaming convolutions
at the training phase to maintain a high model capacity. The experimental
results show that LiCo-Net outperforms single-value decomposition filter (SVDF)
on hardware efficiency with on-par detection performance. Compared to SVDF,
LiCo-Net reduces cycles by 40% on HiFi4 DSP
TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models
Automatic Speech Recognition (ASR) models need to be optimized for specific
hardware before they can be deployed on devices. This can be done by tuning the
model's hyperparameters or exploring variations in its architecture.
Re-training and re-validating models after making these changes can be a
resource-intensive task. This paper presents TODM (Train Once Deploy Many), a
new approach to efficiently train many sizes of hardware-friendly on-device ASR
models with comparable GPU-hours to that of a single training job. TODM
leverages insights from prior work on Supernet, where Recurrent Neural Network
Transducer (RNN-T) models share weights within a Supernet. It reduces layer
sizes and widths of the Supernet to obtain subnetworks, making them smaller
models suitable for all hardware types. We introduce a novel combination of
three techniques to improve the outcomes of the TODM Supernet: adaptive
dropouts, an in-place Alpha-divergence knowledge distillation, and the use of
ScaledAdam optimizer. We validate our approach by comparing Supernet-trained
versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using
LibriSpeech. Results demonstrate that our TODM Supernet either matches or
surpasses the performance of manually tuned models by up to a relative of 3%
better in word error rate (WER), while efficiently keeping the cost of training
many models at a small constant.Comment: Meta AI; Submitted to ICASSP 202
BiT: Robustly Binarized Multi-distilled Transformer
Modern pre-trained transformers have rapidly advanced the state-of-the-art in
machine learning, but have also grown in parameters and computational
complexity, making them increasingly difficult to deploy in
resource-constrained environments. Binarization of the weights and activations
of the network can significantly alleviate these issues, however is technically
challenging from an optimization perspective. In this work, we identify a
series of improvements which enables binary transformers at a much higher
accuracy than what was possible previously. These include a two-set
binarization scheme, a novel elastic binary activation function with learned
parameters, and a method to quantize a network to its limit by successively
distilling higher precision models into lower precision students. These
approaches allow for the first time, fully binarized transformer models that
are at a practical level of accuracy, approaching a full-precision BERT
baseline on the GLUE language understanding benchmark within as little as 5.9%
of the Forward Link Only (FLO) Air Interface. The FLO Air
Abstract—This paper provides an overview of the physical laye