986 research outputs found
Active Learning with Multiple Views
Active learners alleviate the burden of labeling large amounts of data by
detecting and asking the user to label only the most informative examples in
the domain. We focus here on active learning for multi-view domains, in which
there are several disjoint subsets of features (views), each of which is
sufficient to learn the target concept. In this paper we make several
contributions. First, we introduce Co-Testing, which is the first approach to
multi-view active learning. Second, we extend the multi-view learning framework
by also exploiting weak views, which are adequate only for learning a concept
that is more general/specific than the target concept. Finally, we empirically
show that Co-Testing outperforms existing active learners on a variety of real
world domains such as wrapper induction, Web page classification, advertisement
removal, and discourse tree parsing
Bootstrapping for penalized spline regression.
We describe and contrast several different bootstrapping procedures for penalized spline smoothers. The bootstrapping procedures considered are variations on existing methods, developed under two different probabilistic frameworks. Under the first framework, penalized spline regression is considered an estimation technique to find an unknown smooth function. The smooth function is represented in a high dimensional spline basis, with spline coefficients estimated in a penalized form. Under the second framework, the unknown function is treated as a realization of a set of random spline coefficients, which are then predicted in a linear mixed model. We describe how bootstrapping methods can be implemented under both frameworks, and we show in theory and through simulations and examples that bootstrapping provides valid inference in both cases. We compare the inference obtained under both frameworks, and conclude that the latter generally produces better results than the former. The bootstrapping ideas are extended to hypothesis testing, where parametric components in a model are tested against nonparametric alternatives.Methods; Framework; Regression; Linear mixed model; Mixed model; Model; Theory; Simulation; Hypothesis testing;
Exploiting the Statistics of Learning and Inference
When dealing with datasets containing a billion instances or with simulations
that require a supercomputer to execute, computational resources become part of
the equation. We can improve the efficiency of learning and inference by
exploiting their inherent statistical nature. We propose algorithms that
exploit the redundancy of data relative to a model by subsampling data-cases
for every update and reasoning about the uncertainty created in this process.
In the context of learning we propose to test for the probability that a
stochastically estimated gradient points more than 180 degrees in the wrong
direction. In the context of MCMC sampling we use stochastic gradients to
improve the efficiency of MCMC updates, and hypothesis tests based on adaptive
mini-batches to decide whether to accept or reject a proposed parameter update.
Finally, we argue that in the context of likelihood free MCMC one needs to
store all the information revealed by all simulations, for instance in a
Gaussian process. We conclude that Bayesian methods will remain to play a
crucial role in the era of big data and big simulations, but only if we
overcome a number of computational challenges.Comment: Proceedings of the NIPS workshop on "Probabilistic Models for Big
Data
- …