1,919 research outputs found
DimmWitted: A Study of Main-Memory Statistical Analytics
We perform the first study of the tradeoff space of access methods and
replication to support statistical analytics using first-order methods executed
in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical
analytics systems differ from conventional SQL-analytics in the amount and
types of memory incoherence they can tolerate. Our goal is to understand
tradeoffs in accessing the data in row- or column-order and at what granularity
one should share the model and data for a statistical task. We study this new
tradeoff space, and discover there are tradeoffs between hardware and
statistical efficiency. We argue that our tradeoff study may provide valuable
information for designers of analytics engines: for each system we consider,
our prototype engine can run at least one popular task at least 100x faster. We
conduct our study across five architectures using popular models including
SVMs, logistic regression, Gibbs sampling, and neural networks
Aggregations over Generalized Hypertree Decompositions
We study a class of aggregate-join queries with multiple aggregation
operators evaluated over annotated relations. We show that straightforward
extensions of standard multiway join algorithms and generalized hypertree
decompositions (GHDs) provide best-known runtime guarantees. In contrast, prior
work uses bespoke algorithms and data structures and does not match these
guarantees. Our extensions to the standard techniques are a pair of simple
tests that (1) determine if two orderings of aggregation operators are
equivalent and (2) determine if a GHD is compatible with a given ordering.
These tests provide a means to find an optimal GHD that, when provided to
standard join algorithms, will correctly answer a given aggregate-join query.
The second class of our contributions is a pair of complete characterizations
of (1) the set of orderings equivalent to a given ordering and (2) the set of
GHDs compatible with some equivalent ordering. We show by example that previous
approaches are incomplete. The key technical consequence of our
characterizations is a decomposition of a compatible GHD into a set of
(smaller) {\em unconstrained} GHDs, i.e. into a set of GHDs of sub-queries
without aggregations. Since this decomposition is comprised of unconstrained
GHDs, we are able to connect to the wide literature on GHDs for join query
processing, thereby obtaining improved runtime bounds, MapReduce variants, and
an efficient method to find approximately optimal GHDs
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
Gibbs sampling is a Markov chain Monte Carlo technique commonly used for
estimating marginal distributions. To speed up Gibbs sampling, there has
recently been interest in parallelizing it by executing asynchronously. While
empirical results suggest that many models can be efficiently sampled
asynchronously, traditional Markov chain analysis does not apply to the
asynchronous case, and thus asynchronous Gibbs sampling is poorly understood.
In this paper, we derive a better understanding of the two main challenges of
asynchronous Gibbs: bias and mixing time. We show experimentally that our
theoretical results match practical outcomes
Gaussian Quadrature for Kernel Features
Kernel methods have recently attracted resurgent interest, showing
performance competitive with deep neural networks in tasks such as speech
recognition. The random Fourier features map is a technique commonly used to
scale up kernel machines, but employing the randomized feature map means that
samples are required to achieve an approximation error of at
most . We investigate some alternative schemes for constructing
feature maps that are deterministic, rather than random, by approximating the
kernel in the frequency domain using Gaussian quadrature. We show that
deterministic feature maps can be constructed, for any , to achieve
error with samples as
goes to 0. Our method works particularly well with sparse ANOVA
kernels, which are inspired by the convolutional layer of CNNs. We validate our
methods on datasets in different domains, such as MNIST and TIMIT, showing that
deterministic features are faster to generate and achieve accuracy comparable
to the state-of-the-art kernel methods based on random Fourier features.Comment: Neural Information Processing Systems (NIPS) 201
Asynchronous stochastic convex optimization
We show that asymptotically, completely asynchronous stochastic gradient
procedures achieve optimal (even to constant factors) convergence rates for the
solution of convex optimization problems under nearly the same conditions
required for asymptotic optimality of standard stochastic gradient procedures.
Roughly, the noise inherent to the stochastic approximation scheme dominates
any noise from asynchrony. We also give empirical evidence demonstrating the
strong performance of asynchronous, parallel stochastic optimization schemes,
demonstrating that the robustness inherent to stochastic approximation problems
allows substantially faster parallel and asynchronous solution methods.Comment: 38 pages, 8 figure
A Measure of Dependence Between Discrete and Continuous Variables
Mutual Information (MI) is an useful tool for the recognition of mutual
dependence berween data sets. Differen methods for the estimation of MI have
been developed when both data sets are discrete or when both data sets are
continuous. The MI estimation between a discrete data set and a continuous data
set has not received so much attention. We present here a method for the
estimation of MI for this last case based on the kernel density approximation.
The calculation may be of interest in diverse contexts. Since MI is closely
related to Jensen Shannon divergence, the method here developed is of
particular interest in the problem of sequence segmentation
Understanding and Improving Information Transfer in Multi-Task Learning
We investigate multi-task learning approaches that use a shared feature
representation for all tasks. To better understand the transfer of task
information, we study an architecture with a shared module for all tasks and a
separate output module for each task. We study the theory of this setting on
linear and ReLU-activated models. Our key observation is that whether or not
tasks' data are well-aligned can significantly affect the performance of
multi-task learning. We show that misalignment between task data can cause
negative transfer (or hurt performance) and provide sufficient conditions for
positive transfer. Inspired by the theoretical insights, we show that aligning
tasks' embedding layers leads to performance gains for multi-task training and
transfer learning on the GLUE benchmark and sentiment analysis tasks; for
example, we obtain a 2.35% GLUE score average improvement on 5 GLUE tasks over
BERT-LARGE using our alignment method. We also design an SVD-based task
reweighting scheme and show that it improves the robustness of multi-task
training on a multi-label image dataset.Comment: Appeared in ICLR 202
Low-Precision Random Fourier Features for Memory-Constrained Kernel Approximation
We investigate how to train kernel approximation methods that generalize well
under a memory budget. Building on recent theoretical work, we define a measure
of kernel approximation error which we find to be more predictive of the
empirical generalization performance of kernel approximation methods than
conventional metrics. An important consequence of this definition is that a
kernel approximation matrix must be high rank to attain close approximation.
Because storing a high-rank approximation is memory intensive, we propose using
a low-precision quantization of random Fourier features (LP-RFFs) to build a
high-rank approximation under a memory budget. Theoretically, we show
quantization has a negligible effect on generalization performance in important
settings. Empirically, we demonstrate across four benchmark datasets that
LP-RFFs can match the performance of full-precision RFFs and the Nystr\"{o}m
method, with 3x-10x and 50x-460x less memory, respectively.Comment: International Conference on Artificial Intelligence and Statistics
(AISTATS) 201
SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data
We present SwellShark, a framework for building biomedical named entity
recognition (NER) systems quickly and without hand-labeled data. Our approach
views biomedical resources like lexicons as function primitives for
autogenerating weak supervision. We then use a generative model to unify and
denoise this supervision and construct large-scale, probabilistically labeled
datasets for training high-accuracy NER taggers. In three biomedical NER tasks,
SwellShark achieves competitive scores with state-of-the-art supervised
benchmarks using no hand-labeled training data. In a drug name extraction task
using patient medical records, one domain expert using SwellShark achieved
within 5.1% of a crowdsourced annotation approach -- which originally utilized
20 teams over the course of several weeks -- in 24 hours
Asynchrony begets Momentum, with an Application to Deep Learning
Asynchronous methods are widely used in deep learning, but have limited
theoretical justification when applied to non-convex problems. We show that
running stochastic gradient descent (SGD) in an asynchronous manner can be
viewed as adding a momentum-like term to the SGD iteration. Our result does not
assume convexity of the objective function, so it is applicable to deep
learning systems. We observe that a standard queuing model of asynchrony
results in a form of momentum that is commonly used by deep learning
practitioners. This forges a link between queuing theory and asynchrony in deep
learning systems, which could be useful for systems builders. For convolutional
neural networks, we experimentally validate that the degree of asynchrony
directly correlates with the momentum, confirming our main result. An important
implication is that tuning the momentum parameter is important when considering
different levels of asynchrony. We assert that properly tuned momentum reduces
the number of steps required for convergence. Finally, our theory suggests new
ways of counteracting the adverse effects of asynchrony: a simple mechanism
like using negative algorithmic momentum can improve performance under high
asynchrony. Since asynchronous methods have better hardware efficiency, this
result may shed light on when asynchronous execution is more efficient for deep
learning systems.Comment: Full version of a paper published in Annual Allerton Conference on
Communication, Control, and Computing (Allerton) 201
- …