262 research outputs found

    Classification Tree Pruning Under Covariate Shift

    Full text link
    We consider the problem of \emph{pruning} a classification tree, that is, selecting a suitable subtree that balances bias and variance, in common situations with inhomogeneous training data. Namely, assuming access to mostly data from a distribution PX,YP_{X, Y}, but little data from a desired distribution QX,YQ_{X, Y} with different XX-marginals, we present the first efficient procedure for optimal pruning in such situations, when cross-validation and other penalized variants are grossly inadequate. Optimality is derived with respect to a notion of \emph{average discrepancy} PX→QXP_{X} \to Q_{X} (averaged over XX space) which significantly relaxes a recent notion -- termed \emph{transfer-exponent} -- shown to tightly capture the limits of classification under such a distribution shift. Our relaxed notion can be viewed as a measure of \emph{relative dimension} between distributions, as it relates to existing notions of information such as the Minkowski and Renyi dimensions.Comment: 38 pages, 8 figure

    Risk Bounds for Embedded Variable Selection in Classification Trees

    Get PDF
    International audienceThe problems of model and variable selections for classification trees are jointly considered. A penalized criterion is proposed which explicitly takes into account the number of variables, and a risk bound inequality is provided for the tree classifier minimizing this criterion. This penalized criterion is compared to the one used during the pruning step of the CART algorithm. It is shown that the two criteria are similar under some specific margin assumptions. In practice, the tuning parameter of the CART penalty has to be calibrated by hold-out. Simulation studies are performed which confirm that the hold-out procedure mimics the form of the proposed penalized criterion

    An Analogue-Digital Model of Computation: Turing Machines with Physical Oracles

    Get PDF
    We introduce an abstract analogue-digital model of computation that couples Turing machines to oracles that are physical processes. Since any oracle has the potential to boost the computational power of a Turing machine, the effect on the power of the Turing machine of adding a physical process raises interesting questions. Do physical processes add significantly to the power of Turing machines; can they break the Turing Barrier? Does the power of the Turing machine vary with different physical processes? Specifically, here, we take a physical oracle to be a physical experiment, controlled by the Turing machine, that measures some physical quantity. There are three protocols of communication between the Turing machine and the oracle that simulate the types of error propagation common to analogue-digital devices, namely: infinite precision, unbounded precision, and fixed precision. These three types of precision introduce three variants of the physical oracle model. On fixing one archetypal experiment, we show how to classify the computational power of the three models by establishing the lower and upper bounds. Using new techniques and ideas about timing, we give a complete classification.info:eu-repo/semantics/publishedVersio

    A Neyman–Pearson Approach to Statistical Learning

    Full text link
    • …
    corecore