2,059 research outputs found

    Neyman-pearson classiffication under high-dimensional settings

    Get PDF
    Most existing binary classification methods target on the optimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error and a constrained type I error under a user specified level. This article is the first attempt to construct classifiers with guaranteed theoretical performance under the NP paradigm in high-dimensional settings. Based on the fundamental Neyman-Pearson Lemma, we used a plug-in approach to construct NP-Type classifiers for Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical binary classification. Besides their desirable theoretical properties, we also demonstrated their numerical advantages in prioritized error control via both simulation and real data studies

    On the power of conditional independence testing under model-X

    Full text link
    For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative explanations for empirically observed phenomena and novel insights to guide the design of MX methodology. We show that any valid MX CI test must also be valid conditionally on Y and Z; this conditioning allows us to reformulate the problem as testing a point null hypothesis involving the conditional distribution of X. The Neyman-Pearson lemma then implies that the conditional randomization test (CRT) based on a likelihood statistic is the most powerful MX CI test against a point alternative. We also obtain a related optimality result for MX knockoffs. Switching to an asymptotic framework with arbitrarily growing covariate dimension, we derive an expression for the limiting power of the CRT against local semiparametric alternatives in terms of the prediction error of the machine learning algorithm on which its test statistic is based. Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error control under the assumption that only the first two moments of X given Z are known, a significant relaxation of the MX assumption

    Hierarchical Randomized Smoothing

    Full text link
    Real-world data is complex and often consists of objects that can be decomposed into multiple entities (e.g. images into pixels, graphs into interconnected nodes). Randomized smoothing is a powerful framework for making models provably robust against small changes to their inputs - by guaranteeing robustness of the majority vote when randomly adding noise before classification. Yet, certifying robustness on such complex data via randomized smoothing is challenging when adversaries do not arbitrarily perturb entire objects (e.g. images) but only a subset of their entities (e.g. pixels). As a solution, we introduce hierarchical randomized smoothing: We partially smooth objects by adding random noise only on a randomly selected subset of their entities. By adding noise in a more targeted manner than existing methods we obtain stronger robustness guarantees while maintaining high accuracy. We initialize hierarchical smoothing using different noising distributions, yielding novel robustness certificates for discrete and continuous domains. We experimentally demonstrate the importance of hierarchical smoothing in image and node classification, where it yields superior robustness-accuracy trade-offs. Overall, hierarchical smoothing is an important contribution towards models that are both - certifiably robust to perturbations and accurate

    Benchmarking optimality of time series classification methods in distinguishing diffusions

    Get PDF
    Statistical optimality benchmarking is crucial for analyzing and designing time series classification (TSC) algorithms. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is an optimal classifier by the Neyman-Pearson lemma. The LRT benchmarks are computationally efficient because the LRT does not need training, and the diffusion processes can be efficiently simulated and are flexible to reflect the specific features of real-world applications. We demonstrate the benchmarking with three widely-used TSC algorithms: random forest, ResNet, and ROCKET. These algorithms can achieve the LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying high-dimensional nonlinear multivariate time series. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series

    Distribution-Free Rates in Neyman-Pearson Classification

    Full text link
    We consider the problem of Neyman-Pearson classification which models unbalanced classification settings where error w.r.t. a distribution μ1\mu_1 is to be minimized subject to low error w.r.t. a different distribution μ0\mu_0. Given a fixed VC class H\mathcal{H} of classifiers to be minimized over, we provide a full characterization of possible distribution-free rates, i.e., minimax rates over the space of all pairs (μ0,μ1)(\mu_0, \mu_1). The rates involve a dichotomy between hard and easy classes H\mathcal{H} as characterized by a simple geometric condition, a three-points-separation condition, loosely related to VC dimension
    • …
    corecore