28 research outputs found
Improving Generalization Performance by Switching from Adam to SGD
Despite superior training outcomes, adaptive optimization methods such as
Adam, Adagrad or RMSprop have been found to generalize poorly compared to
Stochastic gradient descent (SGD). These methods tend to perform well in the
initial portion of training but are outperformed by SGD at later stages of
training. We investigate a hybrid strategy that begins training with an
adaptive method and switches to SGD when appropriate. Concretely, we propose
SWATS, a simple strategy which switches from Adam to SGD when a triggering
condition is satisfied. The condition we propose relates to the projection of
Adam steps on the gradient subspace. By design, the monitoring process for this
condition adds very little overhead and does not increase the number of
hyperparameters in the optimizer. We report experiments on several standard
benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and
CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling
with recurrent networks on the PTB and WT2 data sets. The results show that our
strategy is capable of closing the generalization gap between SGD and Adam on a
majority of the tasks
Weighted Transformer Network for Machine Translation
State-of-the-art results on neural machine translation often use attentional
sequence-to-sequence models with some form of convolution or recursion. Vaswani
et al. (2017) propose a new architecture that avoids recurrence and convolution
completely. Instead, it uses only self-attention and feed-forward layers. While
the proposed architecture achieves state-of-the-art results on several machine
translation tasks, it requires a large number of parameters and training
iterations to converge. We propose Weighted Transformer, a Transformer with
modified attention layers, that not only outperforms the baseline network in
BLEU score but also converges 15-40% faster. Specifically, we replace the
multi-head attention by multiple self-attention branches that the model learns
to combine during the training process. Our model improves the state-of-the-art
performance by 0.5 BLEU points on the WMT 2014 English-to-German translation
task and by 0.4 on the English-to-French translation task
Regularizing and Optimizing LSTM Language Models
Recurrent neural networks (RNNs), such as long short-term memory networks
(LSTMs), serve as a fundamental building block for many sequence learning
tasks, including machine translation, language modeling, and question
answering. In this paper, we consider the specific problem of word-level
language modeling and investigate strategies for regularizing and optimizing
LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on
hidden-to-hidden weights as a form of recurrent regularization. Further, we
introduce NT-ASGD, a variant of the averaged stochastic gradient method,
wherein the averaging trigger is determined using a non-monotonic condition as
opposed to being tuned by the user. Using these and other regularization
strategies, we achieve state-of-the-art word level perplexities on two data
sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the
effectiveness of a neural cache in conjunction with our proposed model, we
achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and
52.0 on WikiText-2
An Analysis of Neural Language Modeling at Multiple Scales
Many of the leading approaches in language modeling introduce novel, complex
and specialized architectures. We take existing state-of-the-art word level
language models based on LSTMs and QRNNs and extend them to both larger
vocabularies as well as character-level granularity. When properly tuned, LSTMs
and QRNNs achieve state-of-the-art results on character-level (Penn Treebank,
enwik8) and word-level (WikiText-103) datasets, respectively. Results are
obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single
modern GPU
Identifying Generalization Properties in Neural Networks
While it has not yet been proven, empirical evidence suggests that model
generalization is related to local properties of the optima which can be
described via the Hessian. We connect model generalization with the local
property of a solution under the PAC-Bayes paradigm. In particular, we prove
that model generalization ability is related to the Hessian, the higher-order
"smoothness" terms characterized by the Lipschitz constant of the Hessian, and
the scales of the parameters. Guided by the proof, we propose a metric to score
the generalization capability of the model, as well as an algorithm that
optimizes the perturbed model accordingly.Comment: 23 page
The Natural Language Decathlon: Multitask Learning as Question Answering
Deep learning has improved performance on many natural language processing
(NLP) tasks individually. However, general NLP models cannot emerge within a
paradigm that focuses on the particularities of a single metric, dataset, and
task. We introduce the Natural Language Decathlon (decaNLP), a challenge that
spans ten tasks: question answering, machine translation, summarization,
natural language inference, sentiment analysis, semantic role labeling,
zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and
commonsense pronoun resolution. We cast all tasks as question answering over a
context. Furthermore, we present a new Multitask Question Answering Network
(MQAN) jointly learns all tasks in decaNLP without any task-specific modules or
parameters in the multitask setting. MQAN shows improvements in transfer
learning for machine translation and named entity recognition, domain
adaptation for sentiment analysis and natural language inference, and zero-shot
capabilities for text classification. We demonstrate that the MQAN's
multi-pointer-generator decoder is key to this success and performance further
improves with an anti-curriculum training strategy. Though designed for
decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic
parsing task in the single-task setting. We also release code for procuring and
processing data, training and evaluating models, and reproducing all
experiments for decaNLP
Unifying Question Answering, Text Classification, and Regression via Span Extraction
Even as pre-trained language encoders such as BERT are shared across many
tasks, the output layers of question answering, text classification, and
regression models are significantly different. Span decoders are frequently
used for question answering, fixed-class, classification layers for text
classification, and similarity-scoring layers for regression tasks, We show
that this distinction is not necessary and that all three can be unified as
span extraction. A unified, span-extraction approach leads to superior or
comparable performance in supplementary supervised pre-trained, low-data, and
multi-task learning experiments on several question answering, text
classification, and regression benchmarks.Comment: updating paper to also include regression task
Using Mode Connectivity for Loss Landscape Analysis
Mode connectivity is a recently introduced frame- work that empirically
establishes the connected- ness of minima by finding a high accuracy curve
between two independently trained models. To investigate the limits of this
setup, we examine the efficacy of this technique in extreme cases where the
input models are trained or initialized differently. We find that the procedure
is resilient to such changes. Given this finding, we propose using the
framework for analyzing loss surfaces and training trajectories more generally,
and in this direction, study SGD with cosine annealing and restarts (SGDR). We
report that while SGDR moves over barriers in its trajectory, propositions
claiming that it converges to and escapes from multiple local minima are not
substantiated by our empirical results.Comment: Accepted as a workshop paper at ICML's Workshop on Modern Trends in
Nonconvex Optimization for Machine Learning, 201
Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering
End-to-end neural models have made significant progress in question
answering, however recent studies show that these models implicitly assume that
the answer and evidence appear close together in a single document. In this
work, we propose the Coarse-grain Fine-grain Coattention Network (CFC), a new
question answering model that combines information from evidence across
multiple documents. The CFC consists of a coarse-grain module that interprets
documents with respect to the query then finds a relevant answer, and a
fine-grain module which scores each candidate answer by comparing its
occurrences across all of the documents with the query. We design these modules
using hierarchies of coattention and self-attention, which learn to emphasize
different parts of the input. On the Qangaroo WikiHop multi-evidence question
answering task, the CFC obtains a new state-of-the-art result of 70.6% on the
blind test set, outperforming the previous best by 3% accuracy despite not
using pretrained contextual encoders.Comment: ICLR 2019; 9 pages, 7 figure
A Second-Order Method for Convex -Regularized Optimization with Active Set Prediction
We describe an active-set method for the minimization of an objective
function that is the sum of a smooth convex function and an
-regularization term. A distinctive feature of the method is the way in
which active-set identification and {second-order} subspace minimization steps
are integrated to combine the predictive power of the two approaches. At every
iteration, the algorithm selects a candidate set of free and fixed variables,
performs an (inexact) subspace phase, and then assesses the quality of the new
active set. If it is not judged to be acceptable, then the set of free
variables is restricted and a new active-set prediction is made. We establish
global convergence for our approach, and compare the new method against the
state-of-the-art code LIBLINEAR