939 research outputs found
Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control
We introduce a framework for calibrating machine learning models so that
their predictions satisfy explicit, finite-sample statistical guarantees. Our
calibration algorithm works with any underlying model and (unknown)
data-generating distribution and does not require model refitting. The
framework addresses, among other examples, false discovery rate control in
multi-label classification, intersection-over-union control in instance
segmentation, and the simultaneous control of the type-1 error of outlier
detection and confidence set coverage in classification or regression. Our main
insight is to reframe the risk-control problem as multiple hypothesis testing,
enabling techniques and mathematical arguments different from those in the
previous literature. We use our framework to provide new calibration methods
for several core machine learning tasks with detailed worked examples in
computer vision and tabular medical data.Comment: Code available at https://github.com/aangelopoulos/lt
A test for second-order stationarity of time series based on unsystematic sub-samples
In this paper, we introduce a new method for testing the stationarity of time
series, where the test statistic is obtained from measuring and maximising the
difference in the second-order structure over pairs of randomly drawn
intervals. The asymptotic normality of the test statistic is established for
both Gaussian and a range of non-Gaussian time series, and a bootstrap
procedure is proposed for estimating the variance of the main statistics.
Further, we show the consistency of our test under local alternatives. Due to
the flexibility inherent in the random, unsystematic sub-samples used for test
statistic construction, the proposed method is able to identify the intervals
of significant departure from the stationarity without any dyadic constraints,
which is an advantage over other tests employing systematic designs. We
demonstrate its good finite sample performance on both simulated and real data,
particularly in detecting localised departure from the stationarity
Heterogeneous Change Point Inference
We propose HSMUCE (heterogeneous simultaneous multiscale change-point
estimator) for the detection of multiple change-points of the signal in a
heterogeneous gaussian regression model. A piecewise constant function is
estimated by minimizing the number of change-points over the acceptance region
of a multiscale test which locally adapts to changes in the variance. The
multiscale test is a combination of local likelihood ratio tests which are
properly calibrated by scale dependent critical values in order to keep a
global nominal level alpha, even for finite samples. We show that HSMUCE
controls the error of over- and underestimation of the number of change-points.
To this end, new deviation bounds for F-type statistics are derived. Moreover,
we obtain confidence sets for the whole signal. All results are non-asymptotic
and uniform over a large class of heterogeneous change-point models. HSMUCE is
fast to compute, achieves the optimal detection rate and estimates the number
of change-points at almost optimal accuracy for vanishing signals, while still
being robust. We compare HSMUCE with several state of the art methods in
simulations and analyse current recordings of a transmembrane protein in the
bacterial outer membrane with pronounced heterogeneity for its states. An
R-package is available online
Novel methods for anomaly detection
Anomaly detection is of increasing importance in the data rich world of today. It can be applied to a broad range of challenges ranging from fault detection to fraud prevention and cyber-security. Many of these application require algorithms which are very scalable, as well as accurate, due to large data volumes and/or limited computational resources. This thesis contributes three novel approaches to the field of anomaly detection. The first contribution, Collective And Point Anomalies (CAPA) detects and distinguishes between both collective and point anomalies in linear time. The second contribution, MultiVariate Collective And Point Anomalies (MVCAPA) extends CAPA to the multivariate setting. The third contribution is a novel particle based kalman filter which detects and distinguished between additive outliers and innovative outliers
Closed-Form Bayesian Inferences for the Logit Model via Polynomial Expansions
Articles in Marketing and choice literatures have demonstrated the need for
incorporating person-level heterogeneity into behavioral models (e.g., logit
models for multiple binary outcomes as studied here). However, the logit
likelihood extended with a population distribution of heterogeneity doesn't
yield closed-form inferences, and therefore numerical integration techniques
are relied upon (e.g., MCMC methods).
We present here an alternative, closed-form Bayesian inferences for the logit
model, which we obtain by approximating the logit likelihood via a polynomial
expansion, and then positing a distribution of heterogeneity from a flexible
family that is now conjugate and integrable. For problems where the response
coefficients are independent, choosing the Gamma distribution leads to rapidly
convergent closed-form expansions; if there are correlations among the
coefficients one can still obtain rapidly convergent closed-form expansions by
positing a distribution of heterogeneity from a Multivariate Gamma
distribution. The solution then comes from the moment generating function of
the Multivariate Gamma distribution or in general from the multivariate
heterogeneity distribution assumed.
Closed-form Bayesian inferences, derivatives (useful for elasticity
calculations), population distribution parameter estimates (useful for
summarization) and starting values (useful for complicated algorithms) are
hence directly available. Two simulation studies demonstrate the efficacy of
our approach.Comment: 30 pages, 2 figures, corrected some typos. Appears in Quantitative
Marketing and Economics vol 4 (2006), no. 2, 173--20
Rank-statistics based enrichment-site prediction algorithm developed for chromatin immunoprecipitation on chip experiments
Background: High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. Tiling arrays are increasingly used in chromatin immunoprecipitation (IP) experiments (ChIP on chip). ChIP on chip facilitates the generation of genome-wide maps of in-vivo interactions between DNA-associated proteins including transcription factors and DNA. Analysis of the hybridization of an immunoprecipitated sample to a tiling array facilitates the identification of ChIP-enriched segments of the genome. These enriched segments are putative targets of antibody assayable regulatory elements. The enrichment response is not ubiquitous across the genome. Typically 5 to 10% of tiled probes manifest some significant enrichment. Depending upon the factor being studied, this response can drop to less than 1%. The detection and assessment of significance for interactions that emanate from non-canonical and/or un-annotated regions of the genome is especially challenging. This is the motivation behind the proposed algorithm. Results: We have proposed a novel rank and replicate statistics-based methodology for identifying and ascribing statistical confidence to regions of ChIP-enrichment. The algorithm is optimized for identification of sites that manifest low levels of enrichment but are true positives, as validated by alternative biochemical experiments. Although the method is described here in the context of ChIP on chip experiments, it can be generalized to any treatment-control experimental design. The results of the algorithm show a high degree of concordance with independent biochemical validation methods. The sensitivity and specificity of the algorithm have been characterized via quantitative PCR and independent computational approaches. Conclusion: The algorithm ranks all enrichment sites based on their intra-replicate ranks and inter-replicate rank consistency. Following the ranking, the method allows segmentation of sites based on a meta p-value, a composite array signal enrichment criterion, or a composite of these two measures. The sensitivities obtained subsequent to the segmentation of data using a meta p-value of 10(-5), an array signal enrichment of 0.2 and a composite of these two values are 88%, 87% and 95%, respectively
- …