9 research outputs found
RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections
In high dimensions, the classical Hotelling's test tends to have low
power or becomes undefined due to singularity of the sample covariance matrix.
In this paper, this problem is overcome by projecting the data matrix onto
lower dimensional subspaces through multiplication by random matrices. We
propose RAPTT (RAndom Projection T-Test), an exact test for equality of means
of two normal populations based on projected lower dimensional data. RAPTT does
not require any constraints on the dimension of the data or the sample size. A
simulation study indicates that in high dimensions the power of this test is
often greater than that of competing tests. The advantage of RAPTT is
illustrated on high-dimensional gene expression data involving the
discrimination of tumor and normal colon tissues
A label-efficient two-sample test
Two-sample tests evaluate whether two samples are realizations of the same
distribution (the null hypothesis) or two different distributions (the
alternative hypothesis). We consider a new setting for this problem where
sample features are easily measured whereas sample labels are unknown and
costly to obtain. Accordingly, we devise a three-stage framework in service of
performing an effective two-sample test with only a small number of sample
label queries: first, a classifier is trained with samples uniformly labeled to
model the posterior probabilities of the labels; second, a novel query scheme
dubbed \emph{bimodal query} is used to query labels of samples from both
classes, and last, the classical Friedman-Rafsky (FR) two-sample test is
performed on the queried samples. Theoretical analysis and extensive
experiments performed on several datasets demonstrate that the proposed test
controls the Type I error and has decreased Type II error relative to uniform
querying and certainty-based querying. Source code for our algorithms and
experimental results is available at
\url{https://github.com/wayne0908/Label-Efficient-Two-Sample}.Comment: Accepted to the 38th conference on Uncertainty in Artificial
Intelligence (UAI2022
Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints
Many applications of AI involve scoring individuals using a learned function
of their attributes. These predictive risk scores are then used to take
decisions based on whether the score exceeds a certain threshold, which may
vary depending on the context. The level of delegation granted to such systems
in critical applications like credit lending and medical diagnosis will heavily
depend on how questions of fairness can be answered. In this paper, we study
fairness for the problem of learning scoring functions from binary labeled
data, a classic learning task known as bipartite ranking. We argue that the
functional nature of the ROC curve, the gold standard measure of ranking
accuracy in this context, leads to several ways of formulating fairness
constraints. We introduce general families of fairness definitions based on the
AUC and on ROC curves, and show that our ROC-based constraints can be
instantiated such that classifiers obtained by thresholding the scoring
function satisfy classification fairness for a desired range of thresholds. We
establish generalization bounds for scoring functions learned under such
constraints, design practical learning algorithms and show the relevance our
approach with numerical experiments on real and synthetic data.Comment: 35 pages, 13 figures, 6 table
Selective review of offline change point detection methods
This article presents a selective survey of algorithms for the offline
detection of multiple change points in multivariate time series. A general yet
structuring methodological strategy is adopted to organize this vast body of
work. More precisely, detection algorithms considered in this review are
characterized by three elements: a cost function, a search method and a
constraint on the number of changes. Each of those elements is described,
reviewed and discussed separately. Implementations of the main algorithms
described in this article are provided within a Python package called ruptures
Bipartite Ranking: a Risk-Theoretic Perspective
We present a systematic study of the bipartite ranking problem, with the aim of explicating its connections to the class-probability estimation problem. Our study focuses on the properties of the statistical risk for bipartite ranking with general losses, which is closely related to a generalised notion of the area under the ROC
curve: we establish alternate representations of this risk, relate the Bayes-optimal risk to a class of probability divergences, and characterise the set of Bayes-optimal scorers for the risk. We further study properties of a generalised class of bipartite risks, based on the p-norm push of Rudin (2009). Our analysis is based on the rich framework of proper losses, which are the central tool in the study of class-probability estimation. We show
how this analytic tool makes transparent the generalisations of several existing results, such as the equivalence of the minimisers for four seemingly disparate risks from bipartite ranking and class-probability estimation. A novel practical implication of our analysis is the design of new families of losses for scenarios where accuracy at the head of ranked list is paramount, with comparable empirical performance to the p-norm push