4,127 research outputs found
Gaussian Quadrature for Kernel Features
Kernel methods have recently attracted resurgent interest, showing
performance competitive with deep neural networks in tasks such as speech
recognition. The random Fourier features map is a technique commonly used to
scale up kernel machines, but employing the randomized feature map means that
samples are required to achieve an approximation error of at
most . We investigate some alternative schemes for constructing
feature maps that are deterministic, rather than random, by approximating the
kernel in the frequency domain using Gaussian quadrature. We show that
deterministic feature maps can be constructed, for any , to achieve
error with samples as
goes to 0. Our method works particularly well with sparse ANOVA
kernels, which are inspired by the convolutional layer of CNNs. We validate our
methods on datasets in different domains, such as MNIST and TIMIT, showing that
deterministic features are faster to generate and achieve accuracy comparable
to the state-of-the-art kernel methods based on random Fourier features.Comment: Neural Information Processing Systems (NIPS) 201
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
Gibbs sampling is a Markov chain Monte Carlo technique commonly used for
estimating marginal distributions. To speed up Gibbs sampling, there has
recently been interest in parallelizing it by executing asynchronously. While
empirical results suggest that many models can be efficiently sampled
asynchronously, traditional Markov chain analysis does not apply to the
asynchronous case, and thus asynchronous Gibbs sampling is poorly understood.
In this paper, we derive a better understanding of the two main challenges of
asynchronous Gibbs: bias and mixing time. We show experimentally that our
theoretical results match practical outcomes
Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width
Gibbs sampling on factor graphs is a widely used inference technique, which
often produces good empirical results. Theoretical guarantees for its
performance are weak: even for tree structured graphs, the mixing time of Gibbs
may be exponential in the number of variables. To help understand the behavior
of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy
width. We show that under suitable conditions on the weights, bounded hierarchy
width ensures polynomial mixing time. Our study of hierarchy width is in part
motivated by a class of factor graph templates, hierarchical templates, which
have bounded hierarchy width---regardless of the data used to instantiate them.
We demonstrate a rich application from natural language processing in which
Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human
volunteers
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of
machine learning problems. Researchers and industry have developed several
techniques to optimize SGD's runtime performance, including asynchronous
execution and reduced precision. Our main result is a martingale-based analysis
that enables us to capture the rich noise models that may arise from such
techniques. Specifically, we use our new analysis in three ways: (1) we derive
convergence rates for the convex case (Hogwild!) with relaxed assumptions on
the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for
non-convex matrix problems including matrix completion; and (3) we design and
analyze an asynchronous SGD algorithm, called Buckwild!, that uses
lower-precision arithmetic. We show experimentally that our algorithms run
efficiently for a variety of problems on modern hardware
Parallel SGD: When does averaging help?
Consider a number of workers running SGD independently on the same pool of
data and averaging the models every once in a while -- a common but not well
understood practice. We study model averaging as a variance-reducing mechanism
and describe two ways in which the frequency of averaging affects convergence.
For convex objectives, we show the benefit of frequent averaging depends on the
gradient variance envelope. For non-convex objectives, we illustrate that this
benefit depends on the presence of multiple globally optimal points. We
complement our findings with multicore experiments on both synthetic and real
data
Data Programming: Creating Large Training Sets, Quickly
Large labeled training sets are the critical building blocks of supervised
learning methods and are key enablers of deep learning techniques. For some
applications, creating labeled training sets is the most time-consuming and
expensive part of applying machine learning. We therefore propose a paradigm
for the programmatic creation of training sets called data programming in which
users express weak supervision strategies or domain heuristics as labeling
functions, which are programs that label subsets of the data, but that are
noisy and may conflict. We show that by explicitly representing this training
set labeling process as a generative model, we can "denoise" the generated
training set, and establish theoretically that we can recover the parameters of
these generative models in a handful of settings. We then show how to modify a
discriminative loss function to make it noise-aware, and demonstrate our method
over a range of discriminative models including logistic regression and LSTMs.
Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data
programming would have led to a new winning score, and also show that applying
data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points
over a state-of-the-art LSTM baseline (and into second place in the
competition). Additionally, in initial user studies we observed that data
programming may be an easier way for non-experts to create machine learning
models when training data is limited or unavailable
Accelerated Stochastic Power Iteration
Principal component analysis (PCA) is one of the most powerful tools in
machine learning. The simplest method for PCA, the power iteration, requires
full-data passes to recover the principal component of a
matrix with eigen-gap . Lanczos, a significantly more complex method,
achieves an accelerated rate of passes. Modern
applications, however, motivate methods that only ingest a subset of available
data, known as the stochastic setting. In the online stochastic setting, simple
algorithms like Oja's iteration achieve the optimal sample complexity . Unfortunately, they are fully sequential, and also
require iterations, far from the rate of Lanczos. We propose a simple variant of the power
iteration with an added momentum term, that achieves both the optimal sample
and iteration complexity. In the full-pass setting, standard analysis shows
that momentum achieves the accelerated rate, . We
demonstrate empirically that naively applying momentum to a stochastic method,
does not result in acceleration. We perform a novel, tight variance analysis
that reveals the "breaking-point variance" beyond which this acceleration does
not occur. By combining this insight with modern variance reduction techniques,
we construct stochastic PCA algorithms, for the online and offline setting,
that achieve an accelerated iteration complexity .
Due to the embarassingly parallel nature of our methods, this acceleration
translates directly to wall-clock time if deployed in a parallel environment.
Our approach is very general, and applies to many non-convex optimization
problems that can now be accelerated using the same technique.Comment: 37 pages, 5 figure
Incremental Knowledge Base Construction Using DeepDive
Populating a database with unstructured information is a long-standing
problem in industry and research that encompasses problems of extraction,
cleaning, and integration. Recent names used for this problem include dealing
with dark data and knowledge base construction (KBC). In this work, we describe
DeepDive, a system that combines database and machine learning ideas to help
develop KBC systems, and we present techniques to make the KBC process more
efficient. We observe that the KBC process is iterative, and we develop
techniques to incrementally produce inference results for KBC systems. We
propose two methods for incremental inference, based respectively on sampling
and variational techniques. We also study the tradeoff space of these methods
and develop a simple rule-based optimizer. DeepDive includes all of these
contributions, and we evaluate DeepDive on five KBC systems, showing that it
can speed up KBC inference tasks by up to two orders of magnitude with
negligible impact on quality
A Kernel Theory of Modern Data Augmentation
Data augmentation, a technique in which a training set is expanded with
class-preserving transformations, is ubiquitous in modern machine learning
pipelines. In this paper, we seek to establish a theoretical framework for
understanding data augmentation. We approach this from two directions: First,
we provide a general model of augmentation as a Markov process, and show that
kernels appear naturally with respect to this model, even when we do not employ
kernel classification. Next, we analyze more directly the effect of
augmentation on kernel classifiers, showing that data augmentation can be
approximated by first-order feature averaging and second-order variance
regularization components. These frameworks both serve to illustrate the ways
in which data augmentation affects the downstream learning model, and the
resulting analyses provide novel connections between prior work in invariant
kernels, tangent propagation, and robust optimization. Finally, we provide
several proof-of-concept applications showing that our theory can be useful for
accelerating machine learning workflows, such as reducing the amount of
computation needed to train using augmented data, and predicting the utility of
a transformation prior to training
Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
Quantization can improve the execution latency and energy efficiency of
neural networks on both commodity GPUs and specialized accelerators. The
majority of existing literature focuses on training quantized DNNs, while this
work examines the less-studied topic of quantizing a floating-point model
without (re)training. DNN weights and activations follow a bell-shaped
distribution post-training, while practical hardware uses a linear quantization
grid. This leads to challenges in dealing with outliers in the distribution.
Prior work has addressed this by clipping the outliers or using specialized
hardware. In this work, we propose outlier channel splitting (OCS), which
duplicates channels containing outliers, then halves the channel values. The
network remains functionally identical, but affected outliers are moved toward
the center of the distribution. OCS requires no additional training and works
on commodity hardware. Experimental evaluation on ImageNet classification and
language modeling shows that OCS can outperform state-of-the-art clipping
techniques with only minor overhead.Comment: 10 pages; update to ICML camera-ready versio
- …