4,798 research outputs found
No More Pesky Learning Rates
The performance of stochastic gradient descent (SGD) depends critically on
how learning rates are tuned and decreased over time. We propose a method to
automatically adjust multiple learning rates so as to minimize the expected
error at any one time. The method relies on local gradient variations across
samples. In our approach, learning rates can increase as well as decrease,
making it suitable for non-stationary problems. Using a number of convex and
non-convex learning tasks, we show that the resulting algorithm matches the
performance of SGD or other adaptive approaches with their best settings
obtained through systematic search, and effectively removes the need for
learning rate tuning
Large space structures testing
There is considerable interest in the development of testing concepts and facilities that accurately simulate the pathologies believed to exist in future spacecraft. Both the Government and Industry have participated in the development of facilites over the past several years. The progress and problems associated with the development of the Large Space Structure Test Facility at the Marshall Flight Center are presented. This facility was in existence for a number of years and its utilization has run the gamut from total in-house involvement, third party contractor testing, to the mutual participation of other Government Agencies in joint endeavors
Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients
Recent work has established an empirically successful framework for adapting
learning rates for stochastic gradient descent (SGD). This effectively removes
all needs for tuning, while automatically reducing learning rates over time on
stationary problems, and permitting learning rates to grow appropriately in
non-stationary tasks. Here, we extend the idea in three directions, addressing
proper minibatch parallelization, including reweighted updates for sparse or
orthogonal gradients, improving robustness on non-smooth loss functions, in the
process replacing the diagonal Hessian estimation procedure that may not always
be available by a robust finite-difference approximation. The final algorithm
integrates all these components, has linear complexity and is hyper-parameter
free.Comment: Published at the First International Conference on Learning
Representations (ICLR-2013). Public reviews are available at
http://openreview.net/document/c14f2204-fd66-4d91-bed4-153523694041#c14f2204-fd66-4d91-bed4-15352369404
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
ADADELTA: An Adaptive Learning Rate Method
We present a novel per-dimension learning rate method for gradient descent
called ADADELTA. The method dynamically adapts over time using only first order
information and has minimal computational overhead beyond vanilla stochastic
gradient descent. The method requires no manual tuning of a learning rate and
appears robust to noisy gradient information, different model architecture
choices, various data modalities and selection of hyperparameters. We show
promising results compared to other methods on the MNIST digit classification
task using a single machine and on a large scale voice dataset in a distributed
cluster environment.Comment: 6 page
A Robust Adaptive Stochastic Gradient Method for Deep Learning
Stochastic gradient algorithms are the main focus of large-scale optimization
problems and led to important successes in the recent advancement of the deep
learning algorithms. The convergence of SGD depends on the careful choice of
learning rate and the amount of the noise in stochastic estimates of the
gradients. In this paper, we propose an adaptive learning rate algorithm, which
utilizes stochastic curvature information of the loss function for
automatically tuning the learning rates. The information about the element-wise
curvature of the loss function is estimated from the local statistics of the
stochastic first order gradients. We further propose a new variance reduction
technique to speed up the convergence. In our experiments with deep neural
networks, we obtained better performance compared to the popular stochastic
gradient algorithms.Comment: IJCNN 2017 Accepted Paper, An extension of our paper, "ADASECANT:
Robust Adaptive Secant Method for Stochastic Gradient
- …