39,520 research outputs found
On PAC-Bayesian Bounds for Random Forests
Existing guarantees in terms of rigorous upper bounds on the generalization
error for the original random forest algorithm, one of the most frequently used
machine learning methods, are unsatisfying. We discuss and evaluate various
PAC-Bayesian approaches to derive such bounds. The bounds do not require
additional hold-out data, because the out-of-bag samples from the bagging in
the training process can be exploited. A random forest predicts by taking a
majority vote of an ensemble of decision trees. The first approach is to bound
the error of the vote by twice the error of the corresponding Gibbs classifier
(classifying with a single member of the ensemble selected at random). However,
this approach does not take into account the effect of averaging out of errors
of individual classifiers when taking the majority vote. This effect provides a
significant boost in performance when the errors are independent or negatively
correlated, but when the correlations are strong the advantage from taking the
majority vote is small. The second approach based on PAC-Bayesian C-bounds
takes dependencies between ensemble members into account, but it requires
estimating correlations between the errors of the individual classifiers. When
the correlations are high or the estimation is poor, the bounds degrade. In our
experiments, we compute generalization bounds for random forests on various
benchmark data sets. Because the individual decision trees already perform
well, their predictions are highly correlated and the C-bounds do not lead to
satisfactory results. For the same reason, the bounds based on the analysis of
Gibbs classifiers are typically superior and often reasonably tight. Bounds
based on a validation set coming at the cost of a smaller training set gave
better performance guarantees, but worse performance in most experiments
Approximating Likelihood Ratios with Calibrated Discriminative Classifiers
In many fields of science, generalized likelihood ratio tests are established
tools for statistical inference. At the same time, it has become increasingly
common that a simulator (or generative model) is used to describe complex
processes that tie parameters of an underlying theory and measurement
apparatus to high-dimensional observations .
However, simulator often do not provide a way to evaluate the likelihood
function for a given observation , which motivates a new class of
likelihood-free inference algorithms. In this paper, we show that likelihood
ratios are invariant under a specific class of dimensionality reduction maps
. As a direct consequence, we show that
discriminative classifiers can be used to approximate the generalized
likelihood ratio statistic when only a generative model for the data is
available. This leads to a new machine learning-based approach to
likelihood-free inference that is complementary to Approximate Bayesian
Computation, and which does not require a prior on the model parameters.
Experimental results on artificial problems with known exact likelihoods
illustrate the potential of the proposed method.Comment: 35 pages, 5 figure
ModHMM: A Modular Supra-Bayesian Genome Segmentation Method
Genome segmentation methods are powerful tools to obtain cell type or tissue-specific genome-wide annotations and are frequently used to discover regulatory elements. However, traditional segmentation methods show low predictive accuracy and their data-driven annotations have some undesirable properties. As an alternative, we developed ModHMM, a highly modular genome segmentation method. Inspired by the supra-Bayesian approach, it incorporates predictions from a set of classifiers. This allows to compute genome segmentations by utilizing state-of-the-art methodology. We demonstrate the method on ENCODE data and show that it outperforms traditional segmentation methods not only in terms of predictive performance, but also in qualitative aspects. Therefore, ModHMM is a valuable alternative to study the epigenetic and regulatory landscape across and within cell types or tissues
- …