26,954 research outputs found
An algorithm for removing sensitive information: application to race-independent recidivism prediction
Predictive modeling is increasingly being employed to assist human
decision-makers. One purported advantage of replacing or augmenting human
judgment with computer models in high stakes settings-- such as sentencing,
hiring, policing, college admissions, and parole decisions-- is the perceived
"neutrality" of computers. It is argued that because computer models do not
hold personal prejudice, the predictions they produce will be equally free from
prejudice. There is growing recognition that employing algorithms does not
remove the potential for bias, and can even amplify it if the training data
were generated by a process that is itself biased. In this paper, we provide a
probabilistic notion of algorithmic bias. We propose a method to eliminate bias
from predictive models by removing all information regarding protected
variables from the data to which the models will ultimately be trained. Unlike
previous work in this area, our framework is general enough to accommodate data
on any measurement scale. Motivated by models currently in use in the criminal
justice system that inform decisions on pre-trial release and parole, we apply
our proposed method to a dataset on the criminal histories of individuals at
the time of sentencing to produce "race-neutral" predictions of re-arrest. In
the process, we demonstrate that a common approach to creating "race-neutral"
models-- omitting race as a covariate-- still results in racially disparate
predictions. We then demonstrate that the application of our proposed method to
these data removes racial disparities from predictions with minimal impact on
predictive accuracy
Copula Modeling of Multivariate Longitudinal Data with Dropout
Joint multivariate longitudinal and time-to-event data are gaining increasing
attention in the biomedical sciences where subjects are followed over time to
monitor the progress of a disease or medical condition. In the insurance
context, claims outcomes may be related to a policyholder's dropout or decision
to lapse a policy. This paper introduces a generalized method of moments
technique to estimate dependence parameters where associations are represented
using copulas. A simulation study demonstrates the viability of the approach.
The paper describes how the joint model provides new information that insurers
can use to better manage their portfolios of risks using illustrative data from
a Spanish insurer
Penalized linear regression with high-dimensional pairwise screening
In variable selection, most existing screening methods focus on marginal
effects and ignore dependence between covariates. To improve the performance of
selection, we incorporate pairwise effects in covariates for screening and
penalization. We achieve this by studying the asymptotic distribution of the
maximal absolute pairwise sample correlation among independent covariates. The
novelty of the theory is in that the convergence is with respect to the
dimensionality , and is uniform with respect to the sample size .
Moreover, we obtain an upper bound for the maximal pairwise R squared when
regressing the response onto two different covariates. Based on these extreme
value results, we propose a screening procedure to detect covariates pairs that
are potentially correlated and associated with the response. We further combine
the pairwise screening with Sure Independence Screening and develop a new
regularized variable selection procedure. Numerical studies show that our
method is very competitive in terms of both prediction accuracy and variable
selection accuracy
A review of Gaussian Markov models for conditional independence
Markov models lie at the interface between statistical independence in a
probability distribution and graph separation properties. We review model
selection and estimation in directed and undirected Markov models with Gaussian
parametrization, emphasizing the main similarities and differences. These two
model classes are similar but not equivalent, although they share a common
intersection. We present the existing results from a historical perspective,
taking into account the amount of literature existing from both the artificial
intelligence and statistics research communities, where these models were
originated. We cover classical topics such as maximum likelihood estimation and
model selection via hypothesis testing, but also more modern approaches like
regularization and Bayesian methods. We also discuss how the Markov models
reviewed fit in the rich hierarchy of other, higher level Markov model classes.
Finally, we close the paper overviewing relaxations of the Gaussian assumption
and pointing out the main areas of application where these Markov models are
nowadays used.Comment: Fix author signatur
High-dimensional structure learning of binary pairwise Markov networks: A comparative numerical study
Learning the undirected graph structure of a Markov network from data is a
problem that has received a lot of attention during the last few decades. As a
result of the general applicability of the model class, a myriad of methods
have been developed in parallel in several research fields. Recently, as the
size of the considered systems has increased, the focus of new methods has been
shifted towards the high-dimensional domain. In particular, introduction of the
pseudo-likelihood function has pushed the limits of score-based methods which
were originally based on the likelihood function. At the same time, methods
based on simple pairwise tests have been developed to meet the challenges
arising from increasingly large data sets in computational biology. Apart from
being applicable to high-dimensional problems, methods based on the
pseudo-likelihood and pairwise tests are fundamentally very different. To
compare the accuracy of the different types of methods, an extensive numerical
study is performed on data generated by binary pairwise Markov networks. A
parallelizable Gibbs sampler, based on restricted Boltzmann machines, is
proposed as a tool to efficiently sample from sparse high-dimensional networks.
The results of the study show that pairwise methods can be more accurate than
pseudo-likelihood methods in settings often encountered in high-dimensional
structure learning applications
Structure Learning in Graphical Modeling
A graphical model is a statistical model that is associated to a graph whose
nodes correspond to variables of interest. The edges of the graph reflect
allowed conditional dependencies among the variables. Graphical models admit
computationally convenient factorization properties and have long been a
valuable tool for tractable modeling of multivariate distributions. More
recently, applications such as reconstructing gene regulatory networks from
gene expression data have driven major advances in structure learning, that is,
estimating the graph underlying a model. We review some of these advances and
discuss methods such as the graphical lasso and neighborhood selection for
undirected graphical models (or Markov random fields), and the PC algorithm and
score-based search methods for directed graphical models (or Bayesian
networks). We further review extensions that account for effects of latent
variables and heterogeneous data sources
Kernel-based Tests for Joint Independence
We investigate the problem of testing whether random variables, which may
or may not be continuous, are jointly (or mutually) independent. Our method
builds on ideas of the two variable Hilbert-Schmidt independence criterion
(HSIC) but allows for an arbitrary number of variables. We embed the
-dimensional joint distribution and the product of the marginals into a
reproducing kernel Hilbert space and define the -variable Hilbert-Schmidt
independence criterion (dHSIC) as the squared distance between the embeddings.
In the population case, the value of dHSIC is zero if and only if the
variables are jointly independent, as long as the kernel is characteristic.
Based on an empirical estimate of dHSIC, we define three different
non-parametric hypothesis tests: a permutation test, a bootstrap test and a
test based on a Gamma approximation. We prove that the permutation test
achieves the significance level and that the bootstrap test achieves pointwise
asymptotic significance level as well as pointwise asymptotic consistency
(i.e., it is able to detect any type of fixed dependence in the large sample
limit). The Gamma approximation does not come with these guarantees; however,
it is computationally very fast and for small , it performs well in
practice. Finally, we apply the test to a problem in causal discovery.Comment: 67 page
Spatial random field models based on L\'evy indicator convolutions
Process convolutions yield random fields with flexible marginal distributions
and dependence beyond Gaussianity, but statistical inference is often hampered
by a lack of closed-form marginal distributions, and simulation-based inference
may be prohibitively computer-intensive. We here remedy such issues through a
class of process convolutions based on smoothing a (d+1)-dimensional L\'evy
basis with an indicator function kernel to construct a d-dimensional
convolution process. Indicator kernels ensure univariate distributions in the
L\'evy basis family, which provides a sound basis for interpretation,
parametric modeling and statistical estimation. We propose a class of isotropic
stationary convolution processes constructed through hypograph indicator sets
defined as the space between the curve (s,H(s)) of a spherical probability
density function H and the plane (s,0). If H is radially decreasing, the
covariance is expressed through the univariate distribution function of H. The
bivariate joint tail behavior in such convolution processes is studied in
detail. Simulation and modeling extensions beyond isotropic stationary spatial
models are discussed, including latent process models. For statistical
inference of parametric models, we develop pairwise likelihood techniques and
illustrate these on spatially indexed weed counts in the Bjertop data set, and
on daily wind speed maxima observed over 30 stations in the Netherlands
A parallel algorithm for penalized learning of the multivariate exponential family from data of mixed types
Computational efficient evaluation of penalized estimators of multivariate
exponential family distributions is sought. These distributions encompass
amongst others Markov random fields with variates of mixed type (e.g. binary
and continuous) as special case of interest. The model parameter is estimated
by maximization of the pseudo-likelihood augmented with a convex penalty. The
estimator is shown to be consistent. With a world of multi-core computers in
mind, a computationally efficient parallel Newton-Raphson algorithm is
presented for numerical evaluation of the estimator alongside conditions for
its convergence. Parallelization comprises the division of the parameter vector
into subvectors that are estimated simultaneously and subsequently aggregated
to form an estimate of the original parameter. This approach may also enable
efficient numerical evaluation of other high-dimensional estimators. The
performance of the proposed estimator and algorithm are evaluated in a
simulation study, and the paper concludes with an illustration of the presented
methodology in the reconstruction of the conditional independence network from
data of an integrative omics study
A General Framework for Mixed Graphical Models
"Mixed Data" comprising a large number of heterogeneous variables (e.g.
count, binary, continuous, skewed continuous, among other data types) are
prevalent in varied areas such as genomics and proteomics, imaging genetics,
national security, social networking, and Internet advertising. There have been
limited efforts at statistically modeling such mixed data jointly, in part
because of the lack of computationally amenable multivariate distributions that
can capture direct dependencies between such mixed variables of different
types. In this paper, we address this by introducing a novel class of Block
Directed Markov Random Fields (BDMRFs). Using the basic building block of
node-conditional univariate exponential families from Yang et al. (2012), we
introduce a class of mixed conditional random field distributions, that are
then chained according to a block-directed acyclic graph to form our class of
Block Directed Markov Random Fields (BDMRFs). The Markov independence graph
structure underlying a BDMRF thus has both directed and undirected edges. We
introduce conditions under which these distributions exist and are
normalizable, study several instances of our models, and propose scalable
penalized conditional likelihood estimators with statistical guarantees for
recovering the underlying network structure. Simulations as well as an
application to learning mixed genomic networks from next generation sequencing
expression data and mutation data demonstrate the versatility of our methods.Comment: 40 pages, 9 figure
- …