217 research outputs found
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in
Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a
homogeneous design where the same number of experts of the same size are placed
uniformly throughout the network. Furthermore, existing MoE works do not
consider computational constraints (e.g., FLOPs, latency) to guide their
design. To this end, we develop AutoMoE -- a framework for designing
heterogeneous MoE's under computational constraints. AutoMoE leverages Neural
Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with
4x inference speedup (CPU) and FLOPs reduction over manually designed
Transformers, with parity in BLEU score over dense Transformer and within 1
BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for
NMT. Heterogeneous search space with dense and sparsely activated Transformer
modules (e.g., how many experts? where to place them? what should be their
sizes?) allows for adaptive compute -- where different amounts of computations
are used for different tokens in the input. Adaptivity comes naturally from
routing decisions which send tokens to experts of different sizes. AutoMoE
code, data, and trained models are available at https://aka.ms/AutoMoE.Comment: ACL 2023 Finding
Unveiling the frontiers of deep learning: innovations shaping diverse domains
Deep learning (DL) enables the development of computer models that are
capable of learning, visualizing, optimizing, refining, and predicting data. In
recent years, DL has been applied in a range of fields, including audio-visual
data processing, agriculture, transportation prediction, natural language,
biomedicine, disaster management, bioinformatics, drug design, genomics, face
recognition, and ecology. To explore the current state of deep learning, it is
necessary to investigate the latest developments and applications of deep
learning in these disciplines. However, the literature is lacking in exploring
the applications of deep learning in all potential sectors. This paper thus
extensively investigates the potential applications of deep learning across all
major fields of study as well as the associated benefits and challenges. As
evidenced in the literature, DL exhibits accuracy in prediction and analysis,
makes it a powerful computational tool, and has the ability to articulate
itself and optimize, making it effective in processing data with no prior
training. Given its independence from training data, deep learning necessitates
massive amounts of data for effective analysis and processing, much like data
volume. To handle the challenge of compiling huge amounts of medical,
scientific, healthcare, and environmental data for use in deep learning, gated
architectures like LSTMs and GRUs can be utilized. For multimodal learning,
shared neurons in the neural network for all activities and specialized neurons
for particular tasks are necessary.Comment: 64 pages, 3 figures, 3 table
Determinantal Beam Search
Beam search is a go-to strategy for decoding neural sequence models. The
algorithm can naturally be viewed as a subset optimization problem, albeit one
where the corresponding set function does not reflect interactions between
candidates. Empirically, this leads to sets often exhibiting high overlap,
e.g., strings may differ by only a single word. Yet in use-cases that call for
multiple solutions, a diverse or representative set is often desired. To
address this issue, we propose a reformulation of beam search, which we call
determinantal beam search. Determinantal beam search has a natural relationship
to determinantal point processes (DPPs), models over sets that inherently
encode intra-set interactions. By posing iterations in beam search as a series
of subdeterminant maximization problems, we can turn the algorithm into a
diverse subset selection process. In a case study, we use the string
subsequence kernel to explicitly encourage n-gram coverage in text generated
from a sequence model. We observe that our algorithm offers competitive
performance against other diverse set generation strategies in the context of
language generation, while providing a more general approach to optimizing for
diversity
Sparse Fine-tuning for Inference Acceleration of Large Language Models
We consider the problem of accurate sparse fine-tuning of large language
models (LLMs), that is, fine-tuning pretrained LLMs on specialized tasks, while
inducing sparsity in their weights. On the accuracy side, we observe that
standard loss-based fine-tuning may fail to recover accuracy, especially at
high sparsities. To address this, we perform a detailed study of
distillation-type losses, determining an L2-based distillation approach we term
SquareHead which enables accurate recovery even at higher sparsities, across
all model types. On the practical efficiency side, we show that sparse LLMs can
be executed with speedups by taking advantage of sparsity, for both CPU and GPU
runtimes. While the standard approach is to leverage sparsity for computational
reduction, we observe that in the case of memory-bound LLMs sparsity can also
be leveraged for reducing memory bandwidth. We exhibit end-to-end results
showing speedups due to sparsity, while recovering accuracy, on T5 (language
translation), Whisper (speech translation), and open GPT-type (MPT for text
generation). For MPT text generation, we show for the first time that sparse
fine-tuning can reach 75% sparsity without accuracy drops, provide notable
end-to-end speedups for both CPU and GPU inference, and highlight that sparsity
is also compatible with quantization approaches. Models and software for
reproducing our results are provided in Section 6
On the Lipschitz Constant of Deep Networks and Double Descent
Existing bounds on the generalization error of deep networks assume some form
of smooth or bounded dependence on the input variable, falling short of
investigating the mechanisms controlling such factors in practice. In this
work, we present an extensive experimental study of the empirical Lipschitz
constant of deep networks undergoing double descent, and highlight
non-monotonic trends strongly correlating with the test error. Building a
connection between parameter-space and input-space gradients for SGD around a
critical point, we isolate two important factors -- namely loss landscape
curvature and distance of parameters from initialization -- respectively
controlling optimization dynamics around a critical point and bounding model
function complexity, even beyond the training data. Our study presents novels
insights on implicit regularization via overparameterization, and effective
model complexity for networks trained in practice
Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot Translation
Zero-shot translation (ZST), which is generally based on a multilingual
neural machine translation model, aims to translate between unseen language
pairs in training data. The common practice to guide the zero-shot language
mapping during inference is to deliberately insert the source and target
language IDs, e.g., for English and for German. Recent studies have
shown that language IDs sometimes fail to navigate the ZST task, making them
suffer from the off-target problem (non-target language words exist in the
generated translation) and, therefore, difficult to apply the current
multilingual translation model to a broad range of zero-shot language
scenarios. To understand when and why the navigation capabilities of language
IDs are weakened, we compare two extreme decoder input cases in the ZST
directions: Off-Target (OFF) and On-Target (ON) cases. By contrastively
visualizing the contextual word representations (CWRs) of these cases with
teacher forcing, we show that 1) the CWRs of different languages are
effectively distributed in separate regions when the sentence and ID are
matched (ON setting), and 2) if the sentence and ID are unmatched (OFF
setting), the CWRs of different languages are chaotically distributed. Our
analyses suggest that although they work well in ideal ON settings, language
IDs become fragile and lose their navigation ability when faced with off-target
tokens, which commonly exist during inference but are rare in training
scenarios. In response, we employ unlikelihood tuning on the negative (OFF)
samples to minimize their probability such that the language IDs can
discriminate between the on- and off-target tokens during training. Experiments
spanning 40 ZST directions show that our method reduces the off-target ratio by
-48.0% on average, leading to a +9.1 BLEU improvement with only an extra +0.3%
tuning cost
Deep Double Descent via Smooth Interpolation
The ability of overparameterized deep networks to interpolate noisy data,
while at the same time showing good generalization performance, has been
recently characterized in terms of the double descent curve for the test error.
Common intuition from polynomial regression suggests that overparameterized
networks are able to sharply interpolate noisy data, without considerably
deviating from the ground-truth signal, thus preserving generalization ability.
At present, a precise characterization of the relationship between
interpolation and generalization for deep networks is missing. In this work, we
quantify sharpness of fit of the training data interpolated by neural network
functions, by studying the loss landscape w.r.t. to the input variable locally
to each training point, over volumes around cleanly- and noisily-labelled
training samples, as we systematically increase the number of model parameters
and training epochs. Our findings show that loss sharpness in the input space
follows both model- and epoch-wise double descent, with worse peaks observed
around noisy labels. While small interpolating models sharply fit both clean
and noisy data, large interpolating models express a smooth loss landscape,
where noisy targets are predicted over large volumes around training data
points, in contrast to existing intuition
Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation
Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains --- owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models
- …