2,059 research outputs found
Neyman-pearson classiffication under high-dimensional settings
Most existing binary classification methods target on the optimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error and a constrained type I error under a user specified level. This article is the first attempt to construct classifiers with guaranteed theoretical performance under the NP paradigm in high-dimensional settings. Based on the fundamental Neyman-Pearson Lemma, we used a plug-in approach to construct NP-Type classifiers for Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical binary classification. Besides their desirable theoretical properties, we also demonstrated their numerical advantages in prioritized error control via both simulation and real data studies
On the power of conditional independence testing under model-X
For testing conditional independence (CI) of a response Y and a predictor X
given covariates Z, the recently introduced model-X (MX) framework has been the
subject of active methodological research, especially in the context of MX
knockoffs and their successful application to genome-wide association studies.
In this paper, we study the power of MX CI tests, yielding quantitative
explanations for empirically observed phenomena and novel insights to guide the
design of MX methodology. We show that any valid MX CI test must also be valid
conditionally on Y and Z; this conditioning allows us to reformulate the
problem as testing a point null hypothesis involving the conditional
distribution of X. The Neyman-Pearson lemma then implies that the conditional
randomization test (CRT) based on a likelihood statistic is the most powerful
MX CI test against a point alternative. We also obtain a related optimality
result for MX knockoffs. Switching to an asymptotic framework with arbitrarily
growing covariate dimension, we derive an expression for the limiting power of
the CRT against local semiparametric alternatives in terms of the prediction
error of the machine learning algorithm on which its test statistic is based.
Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error
control under the assumption that only the first two moments of X given Z are
known, a significant relaxation of the MX assumption
Hierarchical Randomized Smoothing
Real-world data is complex and often consists of objects that can be
decomposed into multiple entities (e.g. images into pixels, graphs into
interconnected nodes). Randomized smoothing is a powerful framework for making
models provably robust against small changes to their inputs - by guaranteeing
robustness of the majority vote when randomly adding noise before
classification. Yet, certifying robustness on such complex data via randomized
smoothing is challenging when adversaries do not arbitrarily perturb entire
objects (e.g. images) but only a subset of their entities (e.g. pixels). As a
solution, we introduce hierarchical randomized smoothing: We partially smooth
objects by adding random noise only on a randomly selected subset of their
entities. By adding noise in a more targeted manner than existing methods we
obtain stronger robustness guarantees while maintaining high accuracy. We
initialize hierarchical smoothing using different noising distributions,
yielding novel robustness certificates for discrete and continuous domains. We
experimentally demonstrate the importance of hierarchical smoothing in image
and node classification, where it yields superior robustness-accuracy
trade-offs. Overall, hierarchical smoothing is an important contribution
towards models that are both - certifiably robust to perturbations and
accurate
Benchmarking optimality of time series classification methods in distinguishing diffusions
Statistical optimality benchmarking is crucial for analyzing and designing time series classification (TSC) algorithms. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is an optimal classifier by the Neyman-Pearson lemma. The LRT benchmarks are computationally efficient because the LRT does not need training, and the diffusion processes can be efficiently simulated and are flexible to reflect the specific features of real-world applications. We demonstrate the benchmarking with three widely-used TSC algorithms: random forest, ResNet, and ROCKET. These algorithms can achieve the LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying high-dimensional nonlinear multivariate time series. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series
Distribution-Free Rates in Neyman-Pearson Classification
We consider the problem of Neyman-Pearson classification which models
unbalanced classification settings where error w.r.t. a distribution is
to be minimized subject to low error w.r.t. a different distribution .
Given a fixed VC class of classifiers to be minimized over, we
provide a full characterization of possible distribution-free rates, i.e.,
minimax rates over the space of all pairs . The rates involve a
dichotomy between hard and easy classes as characterized by a
simple geometric condition, a three-points-separation condition, loosely
related to VC dimension
- …