15,318 research outputs found
Hard Mixtures of Experts for Large Scale Weakly Supervised Vision
Training convolutional networks (CNN's) that fit on a single GPU with
minibatch stochastic gradient descent has become effective in practice.
However, there is still no effective method for training large CNN's that do
not fit in the memory of a few GPU cards, or for parallelizing CNN training. In
this work we show that a simple hard mixture of experts model can be
efficiently trained to good effect on large scale hashtag (multilabel)
prediction tasks. Mixture of experts models are not new (Jacobs et. al. 1991,
Collobert et. al. 2003), but in the past, researchers have had to devise
sophisticated methods to deal with data fragmentation. We show empirically that
modern weakly supervised data sets are large enough to support naive
partitioning schemes where each data point is assigned to a single expert.
Because the experts are independent, training them in parallel is easy, and
evaluation is cheap for the size of the model. Furthermore, we show that we can
use a single decoding layer for all the experts, allowing a unified feature
embedding space. We demonstrate that it is feasible (and in fact relatively
painless) to train far larger models than could be practically trained with
standard CNN architectures, and that the extra capacity can be well used on
current datasets.Comment: Appearing in CVPR 201
Training Gaussian Mixture Models at Scale via Coresets
How can we train a statistical mixture model on a massive data set? In this
work we show how to construct coresets for mixtures of Gaussians. A coreset is
a weighted subset of the data, which guarantees that models fitting the coreset
also provide a good fit for the original data set. We show that, perhaps
surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension
and the number of mixture components, while being independent of the data set
size. Hence, one can harness computationally intensive algorithms to compute a
good approximation on a significantly smaller data set. More importantly, such
coresets can be efficiently constructed both in distributed and streaming
settings and do not impose restrictions on the data generating process. Our
results rely on a novel reduction of statistical estimation to problems in
computational geometry and new combinatorial complexity results for mixtures of
Gaussians. Empirical evaluation on several real-world datasets suggests that
our coreset-based approach enables significant reduction in training-time with
negligible approximation error
Blind Source Separation with Compressively Sensed Linear Mixtures
This work studies the problem of simultaneously separating and reconstructing
signals from compressively sensed linear mixtures. We assume that all source
signals share a common sparse representation basis. The approach combines
classical Compressive Sensing (CS) theory with a linear mixing model. It allows
the mixtures to be sampled independently of each other. If samples are acquired
in the time domain, this means that the sensors need not be synchronized. Since
Blind Source Separation (BSS) from a linear mixture is only possible up to
permutation and scaling, factoring out these ambiguities leads to a
minimization problem on the so-called oblique manifold. We develop a geometric
conjugate subgradient method that scales to large systems for solving the
problem. Numerical results demonstrate the promising performance of the
proposed algorithm compared to several state of the art methods.Comment: 9 pages, 2 figure
Joint Tensor Factorization and Outlying Slab Suppression with Applications
We consider factoring low-rank tensors in the presence of outlying slabs.
This problem is important in practice, because data collected in many
real-world applications, such as speech, fluorescence, and some social network
data, fit this paradigm. Prior work tackles this problem by iteratively
selecting a fixed number of slabs and fitting, a procedure which may not
converge. We formulate this problem from a group-sparsity promoting point of
view, and propose an alternating optimization framework to handle the
corresponding () minimization-based low-rank tensor
factorization problem. The proposed algorithm features a similar per-iteration
complexity as the plain trilinear alternating least squares (TALS) algorithm.
Convergence of the proposed algorithm is also easy to analyze under the
framework of alternating optimization and its variants. In addition,
regularization and constraints can be easily incorporated to make use of
\emph{a priori} information on the latent loading factors. Simulations and real
data experiments on blind speech separation, fluorescence data analysis, and
social network mining are used to showcase the effectiveness of the proposed
algorithm
Smoothed Analysis of Tensor Decompositions
Low rank tensor decompositions are a powerful tool for learning generative
models, and uniqueness results give them a significant advantage over matrix
decomposition methods. However, tensors pose significant algorithmic challenges
and tensors analogs of much of the matrix algebra toolkit are unlikely to exist
because of hardness results. Efficient decomposition in the overcomplete case
(where rank exceeds dimension) is particularly challenging. We introduce a
smoothed analysis model for studying these questions and develop an efficient
algorithm for tensor decomposition in the highly overcomplete case (rank
polynomial in the dimension). In this setting, we show that our algorithm is
robust to inverse polynomial error -- a crucial property for applications in
learning since we are only allowed a polynomial number of samples. While
algorithms are known for exact tensor decomposition in some overcomplete
settings, our main contribution is in analyzing their stability in the
framework of smoothed analysis.
Our main technical contribution is to show that tensor products of perturbed
vectors are linearly independent in a robust sense (i.e. the associated matrix
has singular values that are at least an inverse polynomial). This key result
paves the way for applying tensor methods to learning problems in the smoothed
setting. In particular, we use it to obtain results for learning multi-view
models and mixtures of axis-aligned Gaussians where there are many more
"components" than dimensions. The assumption here is that the model is not
adversarially chosen, formalized by a perturbation of model parameters. We
believe this an appealing way to analyze realistic instances of learning
problems, since this framework allows us to overcome many of the usual
limitations of using tensor methods.Comment: 32 pages (including appendix
- …