423 research outputs found

    A U-statistic estimator for the variance of resampling-based error estimators

    Get PDF
    We revisit resampling procedures for error estimation in binary classification in terms of U-statistics. In particular, we exploit the fact that the error rate estimator involving all learning-testing splits is a U-statistic. Therefore, several standard theorems on properties of U-statistics apply. In particular, it has minimal variance among all unbiased estimators and is asymptotically normally distributed. Moreover, there is an unbiased estimator for this minimal variance if the total sample size is at least the double learning set size plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys, again, various optimality properties and yields an asymptotically exact hypothesis test of the equality of error rates when two learning algorithms are compared. Our statements apply to any deterministic learning algorithms under weak non-degeneracy assumptions. In an application to tuning parameter choice in lasso regression on a gene expression data set, the test does not reject the null hypothesis of equal rates between two different parameters

    New Intervals for the Difference Between Two Independent Binomial Proportions

    Get PDF
    In this paper we gave an Edgeworth expansion for the studentized difference of two binomial proportions. We then proposed two new intervals by correcting the skewness in the Edgeworth expansion in a direct and an indirect way. Such the bias-correct confidence intervals are easy to compute, and their coverage probabilities converge to the nominal level at a rate of O(n-½), where n is the size of the combined samples. Our simulation results suggest tat in finite samples the new interval based on the indirect method have the similar performance to the two best existing intervals in terms of coverage accuracy and average interval length and that another new interval based on the direct method had the best average coverage accuracy but could have poor coverage when two true binomial proportions are close to the boundary points

    Inference on Optimal Treatment Assignments

    Get PDF
    We consider inference on optimal treatment assignments. Our methods allow for inference on the treatment assignment rule that would be optimal given knowledge of the population treatment effect in a general setting. The procedure uses multiple hypothesis testing methods to determine a subset of the population for which assignment to treatment can be determined to be optimal after conditioning on all available information, with a prespecified level of confidence. A monte carlo study confirms that the inference procedure has good small sample behavior. We apply the method to study the Mexican conditional cash transfer program Progresa

    Computer-intensive statistical methods:saddlepoint approximations with applications in bootstrap and robust inference

    Get PDF
    The saddlepoint approximation was introduced into statistics in 1954 by Henry E. Daniels. This basic result on approximating the density function of the sample mean has been generalized to many situations. The accuracy of this approximation is very good, particularly in the tails of the distribution and for small sample sizes, compared with normal or Edgeworth approximation methods. Before applying saddlepoint approximations to the bootstrap, this thesis will focus on saddlepoint approximations for the distribution of quadratic forms in normal variables and for the distribution of the waiting time in the coupon collector's problem. Both developments illustrate the modern art of statistics relying on the computer and embodying both numeric and analytic approximations. Saddlepoint approximations are extremely accurate in both cases. This is underlined in the first development by means of an extensive study and several applications to nonparametric regression, and in the second by several examples, including the exhaustive bootstrap seen from a collector's point of view. The remaining part of this thesis is devoted to the use of saddlepoint approximations in order to replace the computer-intensive bootstrap. The recent massive increases in computer power have led to an upsurge in interest in computer-intensive statistical methods. The bootstrap is the first computer-intensive method to become widely known. It found an immediate place in statistical theory and, more slowly, in practice. The bootstrap seems to be gaining ground as the method of choice in a number of applied fields, where classical approaches are known to be unreliable, and there is sustained interest from theoreticians in its development. But it is known that, for accurate approximations in the tails, the nonparametric bootstrap requires a large number of replicates of the statistic. As this is time-intensive other methods should be considered. Saddlepoint methods can provide extremely accurate approximations to resampling distributions. As a first step I develop fast saddlepoint approximations to bootstrap distributions that work in the presence of an outlier, using a saddlepoint mixture approximation. Then I look at robust M-estimates of location like Huber's M-estimate of location and its initially MAD scaled version. One peculiarity of the current literature is that saddlepoint methods are often used to approximate the density or distribution functions of bootstrap estimators, rather than related pivots, whereas it is the latter which are more relevant for inference. Hence the aim of the final part of this thesis is to apply saddlepoint approximations to the construction of studentized confidence intervals based on robust M-estimates. As examples I consider the studentized versions of Huber's M-estimate of location, of its initially MAD scaled version and of Huber's proposal 2. In order to make robust inference about a location parameter there are three types of robustness one would like to achieve: robustness of performance for the estimator of location, robustness of validity and robustness of efficiency for the resulting confidence interval method. Hence in the context of studentized bootstrap confidence intervals I investigate these in more detail in order to give recommendations for practical use, underlined by an extensive simulation study

    Small sample sizes : A big data problem in high-dimensional data analysis

    Get PDF
    Acknowledgements The authors are grateful to the Editor, Associate Editor and three anonymous referees for their helpful suggestions, which greatly improved the manuscript. Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research is supported by the German Science Foundation awards number DFG KO 4680/3-2 and PA 2409/3-2.Peer reviewedPublisher PD

    Inference on Optimal Treatment Assignments

    Get PDF
    We consider inference on optimal treatment assignments. Our methods are the first to allow for inference on the treatment assignment rule that would be optimal given knowledge of the population treatment effect in a general setting. The procedure uses multiple hypothesis testing methods to determine a subset of the population for which assignment to treatment can be determined to be optimal after conditioning on all available information, with a prespecified level of confidence. A monte carlo study confirms that the procedure has good small sample behavior. We apply the method to the Mexican conditional cash transfer program Progresa. We demonstrate how the method can be used to design efficient welfare programs by selecting the right beneficiaries and statistically quantifying how strong the evidence is in favor of treating these selected individuals

    The combination of statistical tests of significance

    Get PDF

    Inference on Optimal Treatment Assignments

    Get PDF
    We consider inference on optimal treatment assignments. Our methods allow for inference on the treatment assignment rule that would be optimal given knowledge of the population treatment effect in a general setting. The procedure uses multiple hypothesis testing methods to determine a subset of the population for which assignment to treatment can be determined to be optimal after conditioning on all available information, with a prespecified level of confidence. A Monte Carlo study confirms that the inference procedure has good small sample behavior. We apply the method to study Project STAR and the optimal assignment of small class based on school and teacher characteristics
    corecore