21 research outputs found
Binary and Ternary Natural Language Generation
Ternary and binary neural networks enable multiplication-free computation and
promise multiple orders of magnitude efficiency gains over full-precision
networks if implemented on specialized hardware. However, since both the
parameter and the output space are highly discretized, such networks have
proven very difficult to optimize. The difficulties are compounded for the
class of transformer text generation models due to the sensitivity of the
attention operation to quantization and the noise-compounding effects of
autoregressive decoding in the high-cardinality output space. We approach the
problem with a mix of statistics-based quantization for the weights and elastic
quantization of the activations and demonstrate the first ternary and binary
transformer models on the downstream tasks of summarization and machine
translation. Our ternary BART base achieves an R1 score of 41 on the
CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while
being 16x more efficient. Our binary model, while less accurate, achieves a
highly non-trivial score of 35.6. For machine translation, we achieved BLEU
scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full
precision mBART model score of 26.8. We also compare our approach in the 8-bit
activation setting, where our ternary and even binary weight models can match
or outperform the best existing 8-bit weight models in the literature. Our code
and models are available at:
https://github.com/facebookresearch/Ternary_Binary_TransformerComment: ACL 2023 Ora
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
The study explores the effectiveness of the Chain-of-Thought approach, known
for its proficiency in language tasks by breaking them down into sub-tasks and
intermediate steps, in improving vision-language tasks that demand
sophisticated perception and reasoning. We present the "Description then
Decision" strategy, which is inspired by how humans process signals. This
strategy significantly improves probing task performance by 50%, establishing
the groundwork for future research on reasoning paradigms in complex
vision-language tasks
UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering
We study open-domain question answering with structured, unstructured and
semi-structured knowledge sources, including text, tables, lists and knowledge
bases. Departing from prior work, we propose a unifying approach that
homogenizes all sources by reducing them to text and applies the
retriever-reader model which has so far been limited to text sources only. Our
approach greatly improves the results on knowledge-base QA tasks by 11 points,
compared to latest graph-based methods. More importantly, we demonstrate that
our unified knowledge (UniK-QA) model is a simple and yet effective way to
combine heterogeneous sources of knowledge, advancing the state-of-the-art
results on two popular question answering benchmarks, NaturalQuestions and
WebQuestions, by 3.5 and 2.6 points, respectively
A Study on the Efficiency and Generalization of Light Hybrid Retrievers
Existing hybrid retrievers which integrate sparse and dense retrievers, are
indexing-heavy, limiting their applicability in real-world on-devices settings.
We ask the question "Is it possible to reduce the indexing memory of hybrid
retrievers without sacrificing performance?" Driven by this question, we
leverage an indexing-efficient dense retriever (i.e. DrBoost) to obtain a light
hybrid retriever. Moreover, to further reduce the memory, we introduce a
lighter dense retriever (LITE) which is jointly trained on contrastive learning
and knowledge distillation from DrBoost. Compared to previous heavy hybrid
retrievers, our Hybrid-LITE retriever saves 13 memory while maintaining 98.0
performance.
In addition, we study the generalization of light hybrid retrievers along two
dimensions, out-of-domain (OOD) generalization and robustness against
adversarial attacks. We evaluate models on two existing OOD benchmarks and
create six adversarial attack sets for robustness evaluation. Experiments show
that our light hybrid retrievers achieve better robustness performance than
both sparse and dense retrievers. Nevertheless there is a large room to improve
the robustness of retrievers, and our datasets can aid future research
How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval
Various techniques have been developed in recent years to improve dense
retrieval (DR), such as unsupervised contrastive learning and pseudo-query
generation. Existing DRs, however, often suffer from effectiveness tradeoffs
between supervised and zero-shot retrieval, which some argue was due to the
limited model capacity. We contradict this hypothesis and show that a
generalizable DR can be trained to achieve high accuracy in both supervised and
zero-shot retrieval without increasing model size. In particular, we
systematically examine the contrastive learning of DRs, under the framework of
Data Augmentation (DA). Our study shows that common DA practices such as query
augmentation with generative models and pseudo-relevance label creation using a
cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA
approach with diverse queries and sources of supervision to progressively train
a generalizable DR. As a result, DRAGON, our dense retriever trained with
diverse augmentation, is the first BERT-base-sized DR to achieve
state-of-the-art effectiveness in both supervised and zero-shot evaluations and
even competes with models using more complex late interaction (ColBERTv2 and
SPLADE++)
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Several post-training quantization methods have been applied to large
language models (LLMs), and have been shown to perform well down to 8-bits. We
find that these methods break down at lower bit precision, and investigate
quantization aware training for LLMs (LLM-QAT) to push quantization levels even
further. We propose a data-free distillation method that leverages generations
produced by the pre-trained model, which better preserves the original output
distribution and allows quantizing any generative model independent of its
training data, similar to post-training quantization methods. In addition to
quantizing weights and activations, we also quantize the KV cache, which is
critical for increasing throughput and support long sequence dependencies at
current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B,
at quantization levels down to 4-bits. We observe large improvements over
training-free methods, especially in the low-bit settings
Effective Long-Context Scaling of Foundation Models
We present a series of long-context LLMs that support effective context
windows of up to 32,768 tokens. Our model series are built through continual
pretraining from Llama 2 with longer training sequences and on a dataset where
long texts are upsampled. We perform extensive evaluation on language modeling,
synthetic context probing tasks, and a wide range of research benchmarks. On
research benchmarks, our models achieve consistent improvements on most regular
tasks and significant improvements on long-context tasks over Llama 2. Notably,
with a cost-effective instruction tuning procedure that does not require
human-annotated long instruction data, the 70B variant can already surpass
gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks.
Alongside these results, we provide an in-depth analysis on the individual
components of our method. We delve into Llama's position encodings and discuss
its limitation in modeling long dependencies. We also examine the impact of
various design choices in the pretraining process, including the data mix and
the training curriculum of sequence lengths -- our ablation experiments suggest
that having abundant long texts in the pretrain dataset is not the key to
achieving strong performance, and we empirically verify that long context
continual pretraining is more efficient and similarly effective compared to
pretraining from scratch with long sequences