28 research outputs found
Better Word Embeddings by Disentangling Contextual n-Gram Information
Pre-trained word vectors are ubiquitous in Natural Language Processing
applications. In this paper, we show how training word embeddings jointly with
bigram and even trigram embeddings, results in improved unigram embeddings. We
claim that training word embeddings along with higher n-gram embeddings helps
in the removal of the contextual information from the unigrams, resulting in
better stand-alone word embeddings. We empirically show the validity of our
hypothesis by outperforming other competing word representation models by a
significant margin on a wide variety of tasks. We make our models publicly
available.Comment: NAACL 201
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
The recent tremendous success of unsupervised word embeddings in a multitude
of applications raises the obvious question if similar methods could be derived
to improve embeddings (i.e. semantic representations) of word sequences as
well. We present a simple but efficient unsupervised objective to train
distributed representations of sentences. Our method outperforms the
state-of-the-art unsupervised models on most benchmark tasks, highlighting the
robustness of the produced general-purpose sentence embeddings.Comment: NAACL 201
DoGE: Domain Reweighting with Generalization Estimation
The coverage and composition of the pretraining data corpus significantly
impacts the generalization ability of large language models. Conventionally,
the pretraining corpus is composed of various source domains (e.g. CommonCrawl,
Wikipedia, Github etc.) according to certain sampling probabilities (domain
weights). However, current methods lack a principled way to optimize domain
weights for ultimate goal for generalization. We propose DOmain reweighting
with Generalization Estimation (DoGE), where we reweigh the sampling
probability from each domain based on its contribution to the final
generalization objective assessed by a gradient-based generalization estimation
function. First, we train a small-scale proxy model with a min-max optimization
to obtain the reweighted domain weights. At each step, the domain weights are
updated to maximize the overall generalization gain by mirror descent. Finally
we use the obtained domain weights to train a larger scale full-size language
model. On SlimPajama-6B dataset, with universal generalization objective, DoGE
achieves better average perplexity and zero-shot reasoning accuracy. On
out-of-domain generalization tasks, DoGE reduces perplexity on the target
domain by a large margin. We further apply a parameter-selection scheme which
improves the efficiency of generalization estimation
Taming GANs with Lookahead
Generative Adversarial Networks are notoriously challenging to train. The
underlying minimax optimization is highly susceptible to the variance of the
stochastic gradient and the rotational component of the associated game vector
field. We empirically demonstrate the effectiveness of the Lookahead
meta-optimization method for optimizing games, originally proposed for standard
minimization. The backtracking step of Lookahead naturally handles the
rotational game dynamics, which in turn enables the gradient ascent descent
method to converge on challenging toy games often analyzed in the literature.
Moreover, it implicitly handles high variance without using large mini-batches,
known to be essential for reaching state of the art performance. Experimental
results on MNIST, SVHN, and CIFAR-10, demonstrate a clear advantage of
combining Lookahead with Adam or extragradient, in terms of performance, memory
footprint, and improved stability. Using 30-fold fewer parameters and 16-fold
smaller minibatches we outperform the reported performance of the
class-dependent BigGAN on CIFAR-10 by obtaining FID of \emph{without}
using the class labels, bringing state-of-the-art GAN training within reach of
common computational resources
Revisiting the ACVI Method for Constrained Variational Inequalities
ACVI is a recently proposed first-order method for solving variational
inequalities (VIs) with general constraints. Yang et al. (2022) showed that the
gap function of the last iterate decreases at a rate of
when the operator is -Lipschitz, monotone,
and at least one constraint is active.
In this work, we show that the same guarantee holds when only assuming that
the operator is monotone.
To our knowledge, this is the first analytically derived last-iterate
convergence rate for general monotone VIs, and overall the only one that does
not rely on the assumption that the operator is -Lipschitz.
Furthermore, when the sub-problems of ACVI are solved approximately, we show
that by using a standard warm-start technique the convergence rate stays the
same, provided that the errors decrease at appropriate rates.
We further provide empirical analyses and insights on its implementation for
the latter case
Agree to Disagree: Diversity through Disagreement for Better Transferability
Gradient-based learning algorithms have an implicit simplicity bias which in
effect can limit the diversity of predictors being sampled by the learning
procedure. This behavior can hinder the transferability of trained models by
(i) favoring the learning of simpler but spurious features -- present in the
training data but absent from the test data -- and (ii) by only leveraging a
small subset of predictive features. Such an effect is especially magnified
when the test distribution does not exactly match the train distribution --
referred to as the Out of Distribution (OOD) generalization problem. However,
given only the training data, it is not always possible to apriori assess if a
given feature is spurious or transferable. Instead, we advocate for learning an
ensemble of models which capture a diverse set of predictive features. Towards
this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training),
which enforces agreement among the models on the training data, but
disagreement on the OOD data. We show how D-BAT naturally emerges from the
notion of generalized discrepancy, as well as demonstrate in multiple
experiments how the proposed method can mitigate shortcut-learning, enhance
uncertainty and OOD detection, as well as improve transferability.Comment: 23 pages, 17 figure
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Large language models (LLMs) can potentially democratize access to medical
knowledge. While many efforts have been made to harness and improve LLMs'
medical knowledge and reasoning capacities, the resulting models are either
closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters),
which restricts their abilities. In this work, we improve access to large-scale
medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B
parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through
our adaptation of Nvidia's Megatron-LM distributed trainer), and extends
pretraining on a comprehensively curated medical corpus, including selected
PubMed articles, abstracts, and internationally-recognized medical guidelines.
Evaluations using four major medical benchmarks show significant performance
gains over several state-of-the-art baselines before and after task-specific
finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the
best public baseline in its parameter class and 3% over the strongest baseline
we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B
outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of
Med-PaLM-2. We release our code for curating the medical pretraining corpus and
the MEDITRON model weights to drive open-source development of more capable
medical LLMs
Better Word Embeddings by Disentangling Contextual n-Gram Information
Pre-trained word vectors are ubiquitous in Natural Language Processing applications. In this paper, we show how training word embeddings jointly with bigram and even trigram embeddings, results in improved unigram embeddings. We claim that training word embeddings along with higher n-gram embeddings helps in the removal of the contextual information from the unigrams, resulting in better stand-alone word embeddings. We empirically show the validity of our hypothesis by outperforming other competing word representation models by a significant margin on a wide variety of tasks. We make our models publicly available
On critical points of the relative fractional perimeter
We study the localization of sets with constant nonlocal mean curvature
and prescribed small volume in a bounded open set with smooth boundary, proving that
they are sufficiently close to critical points of a suitable non-local potential. We then
consider the fractional perimeter in half-spaces. We prove the existence of a minimizer
under fixed volume constraint, showing some of its properties such as smoothness and
symmetry, being a graph in the xN -direction, and characterizing its intersection with
the hyperplane {xN = 0}