Search CORE

11 research outputs found

Binary and Ternary Natural Language Generation

Author: Krishnamoorthi Raghuraman
Liu Zechun
Oguz Barlas
Pappu Aasish
Shi Yangyang
Publication venue
Publication date: 02/06/2023
Field of study

Ternary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while being 16x more efficient. Our binary model, while less accurate, achieves a highly non-trivial score of 35.6. For machine translation, we achieved BLEU scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26.8. We also compare our approach in the 8-bit activation setting, where our ternary and even binary weight models can match or outperform the best existing 8-bit weight models in the literature. Our code and models are available at: https://github.com/facebookresearch/Ternary_Binary_TransformerComment: ACL 2023 Ora

arXiv.org e-Print Archive

Learning a Dual-Mode Speech Recognition Model via Self-Pruning

Author: Kalinli Ozlem
Krishnamoorthi Raghuraman
Liu Chunxi
Shangguan Yuan
Shi Yangyang
Yang Haichuan
Publication venue
Publication date: 06/10/2022
Field of study

There is growing interest in unifying the streaming and full-context automatic speech recognition (ASR) networks into a single end-to-end ASR model to simplify the model training and deployment for both use cases. While in real-world ASR applications, the streaming ASR models typically operate under more storage and computational constraints - e.g., on embedded devices - than any server-side full-context models. Motivated by the recent progress in Omni-sparsity supernet training, where multiple subnetworks are jointly optimized in one single model, this work aims to jointly learn a compact sparse on-device streaming ASR model, and a large dense server non-streaming model, in a single supernet. Next, we present that, performing supernet training on both wav2vec 2.0 self-supervised learning and supervised ASR fine-tuning can not only substantially improve the large non-streaming model as shown in prior works, and also be able to improve the compact sparse streaming model.Comment: 7 pages, 1 figure. Accepted for publication at IEEE Spoken Language Technology Workshop (SLT), 202

arXiv.org e-Print Archive

PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion

Author: Chandra Vikas
Krishnamoorthi Raghuraman
Li Meng
Liu Qiang
Wang Dilin
Wu Lemeng
Xiong Yunyang
Publication venue
Publication date: 12/12/2022
Field of study

Fusing camera with LiDAR is a promising technique to improve the accuracy of 3D detection due to the complementary physical properties. While most existing methods focus on fusing camera features directly with raw LiDAR point clouds or shallow 3D features, it is observed that direct deep 3D feature fusion achieves inferior accuracy due to feature misalignment. The misalignment that originates from the feature aggregation across large receptive fields becomes increasingly severe for deep network stages. In this paper, we propose PathFusion to enable path-consistent LiDAR-camera deep feature fusion. PathFusion introduces a path consistency loss between shallow and deep features, which encourages the 2D backbone and its fusion path to transform 2D features in a way that is semantically aligned with the transform of the 3D backbone. We apply PathFusion to the prior-art fusion baseline, Focals Conv, and observe more than 1.2\% mAP improvements on the nuScenes test split consistently with and without testing-time augmentations. Moreover, PathFusion also improves KITTI AP3D (R11) by more than 0.6% on moderate level

arXiv.org e-Print Archive

Gen2Det: Generate to Detect

Author: Culatana Sean Chang
Krishnamoorthi Raghuraman
Shrivastava Abhinav
Sinha Animesh
Suri Saksham
Xiao Fanyi
Zhu Chenchen
Publication venue
Publication date: 07/12/2023
Field of study

Recently diffusion models have shown improvement in synthetic image quality as well as better control in generation. We motivate and present Gen2Det, a simple modular pipeline to create synthetic training data for object detection for free by leveraging state-of-the-art grounded image generation methods. Unlike existing works which generate individual object instances, require identifying foreground followed by pasting on other images, we simplify to directly generating scene-centric images. In addition to the synthetic data, Gen2Det also proposes a suite of techniques to best utilize the generated data, including image-level filtering, instance-level filtering, and better training recipe to account for imperfections in the generation. Using Gen2Det, we show healthy improvements on object detection and segmentation tasks under various settings and agnostic to detection methods. In the long-tailed detection setting on LVIS, Gen2Det improves the performance on rare categories by a large margin while also significantly improving the performance on other categories, e.g. we see an improvement of 2.13 Box AP and 1.84 Mask AP over just training on real data on LVIS with Mask R-CNN. In the low-data regime setting on COCO, Gen2Det consistently improves both Box and Mask AP by 2.27 and 1.85 points. In the most general detection setting, Gen2Det still demonstrates robust performance gains, e.g. it improves the Box and Mask AP on COCO by 0.45 and 0.32 points

arXiv.org e-Print Archive

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Author: Chandra Vikas
Chang Ernie
Krishnamoorthi Raghuraman
Liu Zechun
Mehdad Yashar
Oguz Barlas
Shi Yangyang
Stock Pierre
Zhao Changsheng
Publication venue
Publication date: 29/05/2023
Field of study

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization aware training for LLMs (LLM-QAT) to push quantization levels even further. We propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post-training quantization methods. In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput and support long sequence dependencies at current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. We observe large improvements over training-free methods, especially in the low-bit settings

arXiv.org e-Print Archive

LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting

Author: Alvarez Raziel
Chandra Vikas
Enchev Ivaylo
Huang Yiteng
Krishnamoorthi Raghuraman
Lei Xin
Shi Yangyang
Sun Ming
Tang Limin
Wan Li
Yang Haichuan
Yang Zhaojun
Zhang Biqiao
Publication venue
Publication date: 08/11/2022
Field of study

This paper proposes a hardware-efficient architecture, Linearized Convolution Network (LiCo-Net) for keyword spotting. It is optimized specifically for low-power processor units like microcontrollers. ML operators exhibit heterogeneous efficiency profiles on power-efficient hardware. Given the exact theoretical computation cost, int8 operators are more computation-effective than float operators, and linear layers are often more efficient than other layers. The proposed LiCo-Net is a dual-phase system that uses the efficient int8 linear operators at the inference phase and applies streaming convolutions at the training phase to maintain a high model capacity. The experimental results show that LiCo-Net outperforms single-value decomposition filter (SVDF) on hardware efficiency with on-par detection performance. Compared to SVDF, LiCo-Net reduces cycles by 40% on HiFi4 DSP

arXiv.org e-Print Archive

TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Author: Chandra Vikas
Dalmia Ayushi
Fathullah Yassir
Jia Junteng
Kalinli Ozlem
Krishnamoorthi Raghuraman
Lei Xin
Li Danni
Mahadeokar Jay
Seltzer Mike
Shangguan Yuan
Wang Dilin
Wu Chunyang
Yang Haichuan
Publication venue
Publication date: 05/09/2023
Field of study

Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.Comment: Meta AI; Submitted to ICASSP 202

arXiv.org e-Print Archive

BiT: Robustly Binarized Multi-distilled Transformer

Author: Krishnamoorthi Raghuraman
Li Meng
Liu Zechun
Mehdad Yashar
Oguz Barlas
Pappu Aasish
Xiao Lin
Yih Scott
Publication venue
Publication date: 25/05/2022
Field of study

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%

arXiv.org e-Print Archive

of the Forward Link Only (FLO) Air Interface. The FLO Air

Author: Ashok Mantravadi
Fuyun Ling
G. Kent Walker
Murali R. Chari
Raghuraman Krishnamoorthi
Rajiv Vijayan
Rob Ch
Publication venue
Publication date
Field of study

Abstract—This paper provides an overview of the physical laye

CiteSeerX