207 research outputs found
Encrypted statistical machine learning: new privacy preserving methods
We present two new statistical machine learning methods designed to learn on
fully homomorphic encrypted (FHE) data. The introduction of FHE schemes
following Gentry (2009) opens up the prospect of privacy preserving statistical
machine learning analysis and modelling of encrypted data without compromising
security constraints. We propose tailored algorithms for applying extremely
random forests, involving a new cryptographic stochastic fraction estimator,
and na\"{i}ve Bayes, involving a semi-parametric model for the class decision
boundary, and show how they can be used to learn and predict from encrypted
data. We demonstrate that these techniques perform competitively on a variety
of classification data sets and provide detailed information about the
computational practicalities of these and other FHE methods.Comment: 39 page
kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R
Background: Approximating the recent phylogeny of N phased haplotypes at a set of variants along the genome is a core problem in modern population genomics and central to performing genome-wide screens for association, selection, introgression, and other signals. The Li & Stephens (LS) model provides a simple yet powerful hidden Markov model for inferring the recent ancestry at a given variant, represented as an N×N distance matrix based on posterior decodings. Results: We provide a high-performance engine to make these posterior decodings readily accessible with minimal pre-processing via an easy to use package kalis, in the statistical programming language R. kalis enables investigators to rapidly resolve the ancestry at loci of interest and developers to build a range of variant-specific ancestral inference pipelines on top. kalis exploits both multi-core parallelism and modern CPU vector instruction sets to enable scaling to hundreds of thousands of genomes. Conclusions: The resulting distance matrices accessible via kalis enable local ancestry, selection, and association studies in modern large scale genomic datasets
Encrypted accelerated least squares regression.
Information that is stored in an encrypted format is, by definition, usually not amenable to statistical analysis or machine learning methods. In this paper we present detailed analysis of coordinate and accelerated gradient descent algorithms which are capable of fitting least squares and penalised ridge regression models, using data encrypted under a fully homomorphic encryption scheme. Gradient descent is shown to dominate in terms of encrypted computational speed, and theoretical results are proven to give parameter bounds which ensure correctness of decryption. The characteristics of encrypted computation are empirically shown to favour a non-standard acceleration technique. This demonstrates the possibility of approximating conventional statistical regression methods using encrypted data without compromising privacy
Survival signature-based sensitivity analysis of systems with epistemic uncertainties
The survival signature provides a basis for efficient reliability assessment of systems with more than one component type. Often a perfect probabilistic modelling of the system is not possible due to limited information, vagueness and imprecision. Hence generalized probabilistic methods need to be used. These methods allow to explicitly model the uncertainties without the need of unjustified hypotheses and approximation. In this paper, a novel and efficient sensitivity approach is presented. The proposed approach is based on survival signature, allowing to identify and rank components in a system. A numerical example is used to illustrate the above methods
Model updating after interventions paradoxically introduces bias
Machine learning is increasingly being used to generate prediction models for
use in a number of real-world settings, from credit risk assessment to clinical
decision support. Recent discussions have highlighted potential problems in the
updating of a predictive score for a binary outcome when an existing predictive
score forms part of the standard workflow, driving interventions. In this
setting, the existing score induces an additional causative pathway which leads
to miscalibration when the original score is replaced. We propose a general
causal framework to describe and address this problem, and demonstrate an
equivalent formulation as a partially observed Markov decision process. We use
this model to demonstrate the impact of such `naive updating' when performed
repeatedly. Namely, we show that successive predictive scores may converge to a
point where they predict their own effect, or may eventually tend toward a
stable oscillation between two values, and we argue that neither outcome is
desirable. Furthermore, we demonstrate that even if model-fitting procedures
improve, actual performance may worsen. We complement these findings with a
discussion of several potential routes to overcome these issues.Comment: Sections of this preprint on 'Successive adjuvancy' (section 4,
theorem 2, figures 4,5, and associated discussions) were not included in the
originally submitted version of this paper due to length. This material does
not appear in the published version of this manuscript, and the reader should
be aware that these sections did not undergo peer revie
- …