16 research outputs found
SHARCS: Efficient Transformers through Routing with Dynamic Width Sub-networks
We introduce SHARCS for adaptive inference that takes into account the
hardness of input samples. SHARCS can train a router on any transformer
network, enabling the model to direct different samples to sub-networks with
varying widths. Our experiments demonstrate that: (1) SHARCS outperforms or
complements existing per-sample adaptive inference methods across various
classification tasks in terms of accuracy vs. FLOPs; (2) SHARCS generalizes
across different architectures and can be even applied to compressed and
efficient transformer encoders to further improve their efficiency; (3) SHARCS
can provide a 2 times inference speed up at an insignificant drop in accuracy
EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval
Dense embedding-based retrieval is now the industry standard for semantic
search and ranking problems, like obtaining relevant web documents for a given
query. Such techniques use a two-stage process: (a) contrastive learning to
train a dual encoder to embed both the query and documents and (b) approximate
nearest neighbor search (ANNS) for finding similar documents for a given query.
These two stages are disjoint; the learned embeddings might be ill-suited for
the ANNS method and vice-versa, leading to suboptimal performance. In this
work, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns
both the embeddings and the ANNS structure to optimize retrieval performance.
EHI uses a standard dual encoder model for embedding queries and documents
while learning an inverted file index (IVF) style tree structure for efficient
ANNS. To ensure stable and efficient learning of discrete tree-based ANNS
structure, EHI introduces the notion of dense path embedding that captures the
position of a query/document in the tree. We demonstrate the effectiveness of
EHI on several benchmarks, including de-facto industry standard MS MARCO (Dev
set and TREC DL19) datasets. For example, with the same compute budget, EHI
outperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and
by 4.2% (nDCG@10) on TREC DL19 benchmarks
FLUID: A Unified Evaluation Framework for Flexible Sequential Data
Modern ML methods excel when training data is IID, large-scale, and well
labeled. Learning in less ideal conditions remains an open challenge. The
sub-fields of few-shot, continual, transfer, and representation learning have
made substantial strides in learning under adverse conditions; each affording
distinct advantages through methods and insights. These methods address
different challenges such as data arriving sequentially or scarce training
examples, however often the difficult conditions an ML system will face over
its lifetime cannot be anticipated prior to deployment. Therefore, general ML
systems which can handle the many challenges of learning in practical settings
are needed. To foster research towards the goal of general ML methods, we
introduce a new unified evaluation framework - FLUID (Flexible Sequential
Data). FLUID integrates the objectives of few-shot, continual, transfer, and
representation learning while enabling comparison and integration of techniques
across these subfields. In FLUID, a learner faces a stream of data and must
make sequential predictions while choosing how to update itself, adapt quickly
to novel classes, and deal with changing data distributions; while accounting
for the total amount of compute. We conduct experiments on a broad set of
methods which shed new insight on the advantages and limitations of current
solutions and indicate new research problems to solve. As a starting point
towards more general methods, we present two new baselines which outperform
other evaluated methods on FLUID. Project page:
https://raivn.cs.washington.edu/projects/FLUID/.Comment: 27 pages, 6 figures. Project page:
https://raivn.cs.washington.edu/projects/FLUID
Neural Priming for Sample-Efficient Adaptation
We propose Neural Priming, a technique for adapting large pretrained models
to distribution shifts and downstream tasks given few or no labeled examples.
Presented with class names or unlabeled test samples, Neural Priming enables
the model to recall and conditions its parameters on relevant data seen
throughout pretraining, thereby priming it for the test distribution. Neural
Priming can be performed at test time in even for pretraining datasets as large
as LAION-2B. Performing lightweight updates on the recalled data significantly
improves accuracy across a variety of distribution shift and transfer learning
benchmarks. Concretely, in the zero-shot setting, we see a 2.45 improvement in
accuracy on ImageNet and 3.81 accuracy improvement on average across standard
transfer learning benchmarks. Further, using our test time inference scheme, we
see a 1.41 accuracy improvement on ImageNetV2. These results demonstrate the
effectiveness of Neural Priming in addressing the common challenge of limited
labeled data and changing distributions. Code is available at
github.com/RAIVNLab/neural-priming.Comment: 18 pages, 8 figures, 9 table
Matryoshka Representation Learning
Learned representations are a central component in modern ML systems, serving
a multitude of downstream tasks. When training such representations, it is
often the case that computational and statistical constraints for each
downstream task are unknown. In this context rigid, fixed capacity
representations can be either over or under-accommodating to the task at hand.
This leads us to ask: can we design a flexible representation that can adapt to
multiple downstream tasks with varying computational resources? Our main
contribution is Matryoshka Representation Learning (MRL) which encodes
information at different granularities and allows a single embedding to adapt
to the computational constraints of downstream tasks. MRL minimally modifies
existing representation learning pipelines and imposes no additional cost
during inference and deployment. MRL learns coarse-to-fine representations that
are at least as accurate and rich as independently trained low-dimensional
representations. The flexibility within the learned Matryoshka Representations
offer: (a) up to 14x smaller embedding size for ImageNet-1K classification at
the same level of accuracy; (b) up to 14x real-world speed-ups for large-scale
retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for
long-tail few-shot classification, all while being as robust as the original
representations. Finally, we show that MRL extends seamlessly to web-scale
datasets (ImageNet, JFT) across various modalities -- vision (ViT, ResNet),
vision + language (ALIGN) and language (BERT). MRL code and pretrained models
are open-sourced at https://github.com/RAIVNLab/MRL.Comment: 35 pages, 12 figures. NeurIPS 2022 camera ready publicatio
AdANNS: A Framework for Adaptive Semantic Search
Web-scale search systems learn an encoder to embed a given query which is
then hooked into an approximate nearest neighbor search (ANNS) pipeline to
retrieve similar data points. To accurately capture tail queries and data
points, learned representations typically are rigid, high-dimensional vectors
that are generally used as-is in the entire ANNS pipeline and can lead to
computationally expensive retrieval. In this paper, we argue that instead of
rigid representations, different stages of ANNS can leverage adaptive
representations of varying capacities to achieve significantly better
accuracy-compute trade-offs, i.e., stages of ANNS that can get away with more
approximate computation should use a lower-capacity representation of the same
data point. To this end, we introduce AdANNS, a novel ANNS design framework
that explicitly leverages the flexibility of Matryoshka Representations. We
demonstrate state-of-the-art accuracy-compute trade-offs using novel
AdANNS-based key ANNS building blocks like search data structures (AdANNS-IVF)
and quantization (AdANNS-OPQ). For example on ImageNet retrieval, AdANNS-IVF is
up to 1.5% more accurate than the rigid representations-based IVF at the same
compute budget; and matches accuracy while being up to 90x faster in wall-clock
time. For Natural Questions, 32-byte AdANNS-OPQ matches the accuracy of the
64-byte OPQ baseline constructed using rigid representations -- same accuracy
at half the cost! We further show that the gains from AdANNS translate to
modern-day composite ANNS indices that combine search structures and
quantization. Finally, we demonstrate that AdANNS can enable inference-time
adaptivity for compute-aware search on ANNS indices built non-adaptively on
matryoshka representations. Code is open-sourced at
https://github.com/RAIVNLab/AdANNS.Comment: 25 pages, 15 figures. NeurIPS 2023 camera ready publicatio