67,038 research outputs found
On Universal Features for High-Dimensional Learning and Inference
We consider the problem of identifying universal low-dimensional features
from high-dimensional data for inference tasks in settings involving learning.
For such problems, we introduce natural notions of universality and we show a
local equivalence among them. Our analysis is naturally expressed via
information geometry, and represents a conceptually and computationally useful
analysis. The development reveals the complementary roles of the singular value
decomposition, Hirschfeld-Gebelein-R\'enyi maximal correlation, the canonical
correlation and principle component analyses of Hotelling and Pearson, Tishby's
information bottleneck, Wyner's common information, Ky Fan -norms, and
Brieman and Friedman's alternating conditional expectations algorithm. We
further illustrate how this framework facilitates understanding and optimizing
aspects of learning systems, including multinomial logistic (softmax)
regression and the associated neural network architecture, matrix factorization
methods for collaborative filtering and other applications, rank-constrained
multivariate linear regression, and forms of semi-supervised learning
Extreme Classification in Log Memory
We present Merged-Averaged Classifiers via Hashing (MACH) for
K-classification with ultra-large values of K. Compared to traditional
one-vs-all classifiers that require O(Kd) memory and inference cost, MACH only
need O(d log K) (d is dimensionality )memory while only requiring O(K log K + d
log K) operation for inference. MACH is a generic K-classification algorithm,
with provably theoretical guarantees, which requires O(log K) memory without
any assumption on the relationship between classes. MACH uses universal hashing
to reduce classification with a large number of classes to few independent
classification tasks with small (constant) number of classes. We provide
theoretical quantification of discriminability-memory tradeoff. With MACH we
can train ODP dataset with 100,000 classes and 400,000 features on a single
Titan X GPU, with the classification accuracy of 19.28%, which is the
best-reported accuracy on this dataset. Before this work, the best performing
baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB
model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy
with 480x reduction in the model size (of mere 0.3GB). With MACH, we also
demonstrate complete training of fine-grained imagenet dataset (compressed size
104GB), with 21,000 classes, on a single GPU. To the best of our knowledge,
this is the first work to demonstrate complete training of these extreme-class
datasets on a single Titan X
Metric for Automatic Machine Translation Evaluation based on Universal Sentence Representations
Sentence representations can capture a wide range of information that cannot
be captured by local features based on character or word N-grams. This paper
examines the usefulness of universal sentence representations for evaluating
the quality of machine translation. Although it is difficult to train sentence
representations using small-scale translation datasets with manual evaluation,
sentence representations trained from large-scale data in other tasks can
improve the automatic evaluation of machine translation. Experimental results
of the WMT-2016 dataset show that the proposed method achieves state-of-the-art
performance with sentence representation features only.Comment: NAACL 2018 Student Research Workshop; 6 page
Kernel Mean Embedding of Distributions: A Review and Beyond
A Hilbert space embedding of a distribution---in short, a kernel mean
embedding---has recently emerged as a powerful tool for machine learning and
inference. The basic idea behind this framework is to map distributions into a
reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel
methods can be extended to probability measures. It can be viewed as a
generalization of the original "feature map" common to support vector machines
(SVMs) and other kernel methods. While initially closely associated with the
latter, it has meanwhile found application in fields ranging from kernel
machines and probabilistic modeling to statistical inference, causal discovery,
and deep learning. The goal of this survey is to give a comprehensive review of
existing work and recent advances in this research area, and to discuss the
most challenging issues and open problems that could lead to new research
directions. The survey begins with a brief introduction to the RKHS and
positive definite kernels which forms the backbone of this survey, followed by
a thorough discussion of the Hilbert space embedding of marginal distributions,
theoretical guarantees, and a review of its applications. The embedding of
distributions enables us to apply RKHS methods to probability measures which
prompts a wide range of applications such as kernel two-sample testing,
independent testing, and learning on distributional data. Next, we discuss the
Hilbert space embedding for conditional distributions, give theoretical
insights, and review some applications. The conditional mean embedding enables
us to perform sum, product, and Bayes' rules---which are ubiquitous in
graphical model, probabilistic inference, and reinforcement learning---in a
non-parametric way. We then discuss relationships between this framework and
other related areas. Lastly, we give some suggestions on future research
directions.Comment: 147 pages; this is a version of the manuscript after the review
proces
A Linear Dynamical System Model for Text
Low dimensional representations of words allow accurate NLP models to be
trained on limited annotated data. While most representations ignore words'
local context, a natural way to induce context-dependent representations is to
perform inference in a probabilistic latent-variable sequence model. Given the
recent success of continuous vector space word representations, we provide such
an inference procedure for continuous states, where words' representations are
given by the posterior mean of a linear dynamical system. Here, efficient
inference can be performed using Kalman filtering. Our learning algorithm is
extremely scalable, operating on simple cooccurrence counts for both parameter
initialization using the method of moments and subsequent iterations of EM. In
our experiments, we employ our inferred word embeddings as features in standard
tagging tasks, obtaining significant accuracy improvements. Finally, the Kalman
filter updates can be seen as a linear recurrent neural network. We demonstrate
that using the parameters of our model to initialize a non-linear recurrent
neural network language model reduces its training time by a day and yields
lower perplexity.Comment: Accepted at International Conference of Machine Learning 201
Deep Learning
Deep learning (DL) is a high dimensional data reduction technique for
constructing high-dimensional predictors in input-output models. DL is a form
of machine learning that uses hierarchical layers of latent features. In this
article, we review the state-of-the-art of deep learning from a modeling and
algorithmic perspective. We provide a list of successful areas of applications
in Artificial Intelligence (AI), Image Processing, Robotics and Automation.
Deep learning is predictive in its nature rather then inferential and can be
viewed as a black-box methodology for high-dimensional function estimation.Comment: arXiv admin note: text overlap with arXiv:1602.0656
An Information Theoretic Interpretation to Deep Neural Networks
It is commonly believed that the hidden layers of deep neural networks (DNNs)
attempt to extract informative features for learning tasks. In this paper, we
formalize this intuition by showing that the features extracted by DNN coincide
with the result of an optimization problem, which we call the `universal
feature selection' problem, in a local analysis regime. We interpret the
weights training in DNN as the projection of feature functions between feature
spaces, specified by the network structure. Our formulation has direct
operational meaning in terms of the performance for inference tasks, and gives
interpretations to the internal computation results of DNNs. Results of
numerical experiments are provided to support the analysis.Comment: Accepted to ISIT 201
Stein Variational Gradient Descent as Moment Matching
Stein variational gradient descent (SVGD) is a non-parametric inference
algorithm that evolves a set of particles to fit a given distribution of
interest. We analyze the non-asymptotic properties of SVGD, showing that there
exists a set of functions, which we call the Stein matching set, whose
expectations are exactly estimated by any set of particles that satisfies the
fixed point equation of SVGD. This set is the image of Stein operator applied
on the feature maps of the positive definite kernel used in SVGD. Our results
provide a theoretical framework for analyzing the properties of SVGD with
different kernels, shedding insight into optimal kernel choice. In particular,
we show that SVGD with linear kernels yields exact estimation of means and
variances on Gaussian distributions, while random Fourier features enable
probabilistic bounds for distributional approximation. Our results offer a
refreshing view of the classical inference problem as fitting Stein's identity
or solving the Stein equation, which may motivate more efficient algorithms.Comment: Conference on Neural Information Processing Systems (NIPS) 201
GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations
Modern deep transfer learning approaches have mainly focused on learning
generic feature vectors from one task that are transferable to other tasks,
such as word embeddings in language and pretrained convolutional features in
vision. However, these approaches usually transfer unary features and largely
ignore more structured graphical representations. This work explores the
possibility of learning generic latent relational graphs that capture
dependencies between pairs of data units (e.g., words or pixels) from
large-scale unlabeled data and transferring the graphs to downstream tasks. Our
proposed transfer learning framework improves performance on various tasks
including question answering, natural language inference, sentiment analysis,
and image classification. We also show that the learned graphs are generic
enough to be transferred to different embeddings on which the graphs have not
been trained (including GloVe embeddings, ELMo embeddings, and task-specific
RNN hidden unit), or embedding-free units such as image pixels
Deep Kernel Learning
We introduce scalable deep kernels, which combine the structural properties
of deep learning architectures with the non-parametric flexibility of kernel
methods. Specifically, we transform the inputs of a spectral mixture base
kernel with a deep architecture, using local kernel interpolation, inducing
points, and structure exploiting (Kronecker and Toeplitz) algebra for a
scalable kernel representation. These closed-form kernels can be used as
drop-in replacements for standard kernels, with benefits in expressive power
and scalability. We jointly learn the properties of these kernels through the
marginal likelihood of a Gaussian process. Inference and learning cost
for training points, and predictions cost per test point. On a large
and diverse collection of applications, including a dataset with 2 million
examples, we show improved performance over scalable Gaussian processes with
flexible kernel learning models, and stand-alone deep architectures.Comment: 19 pages, 6 figure
- …