26,954 research outputs found

    An algorithm for removing sensitive information: application to race-independent recidivism prediction

    Full text link
    Predictive modeling is increasingly being employed to assist human decision-makers. One purported advantage of replacing or augmenting human judgment with computer models in high stakes settings-- such as sentencing, hiring, policing, college admissions, and parole decisions-- is the perceived "neutrality" of computers. It is argued that because computer models do not hold personal prejudice, the predictions they produce will be equally free from prejudice. There is growing recognition that employing algorithms does not remove the potential for bias, and can even amplify it if the training data were generated by a process that is itself biased. In this paper, we provide a probabilistic notion of algorithmic bias. We propose a method to eliminate bias from predictive models by removing all information regarding protected variables from the data to which the models will ultimately be trained. Unlike previous work in this area, our framework is general enough to accommodate data on any measurement scale. Motivated by models currently in use in the criminal justice system that inform decisions on pre-trial release and parole, we apply our proposed method to a dataset on the criminal histories of individuals at the time of sentencing to produce "race-neutral" predictions of re-arrest. In the process, we demonstrate that a common approach to creating "race-neutral" models-- omitting race as a covariate-- still results in racially disparate predictions. We then demonstrate that the application of our proposed method to these data removes racial disparities from predictions with minimal impact on predictive accuracy

    Copula Modeling of Multivariate Longitudinal Data with Dropout

    Full text link
    Joint multivariate longitudinal and time-to-event data are gaining increasing attention in the biomedical sciences where subjects are followed over time to monitor the progress of a disease or medical condition. In the insurance context, claims outcomes may be related to a policyholder's dropout or decision to lapse a policy. This paper introduces a generalized method of moments technique to estimate dependence parameters where associations are represented using copulas. A simulation study demonstrates the viability of the approach. The paper describes how the joint model provides new information that insurers can use to better manage their portfolios of risks using illustrative data from a Spanish insurer

    Penalized linear regression with high-dimensional pairwise screening

    Full text link
    In variable selection, most existing screening methods focus on marginal effects and ignore dependence between covariates. To improve the performance of selection, we incorporate pairwise effects in covariates for screening and penalization. We achieve this by studying the asymptotic distribution of the maximal absolute pairwise sample correlation among independent covariates. The novelty of the theory is in that the convergence is with respect to the dimensionality pp, and is uniform with respect to the sample size nn. Moreover, we obtain an upper bound for the maximal pairwise R squared when regressing the response onto two different covariates. Based on these extreme value results, we propose a screening procedure to detect covariates pairs that are potentially correlated and associated with the response. We further combine the pairwise screening with Sure Independence Screening and develop a new regularized variable selection procedure. Numerical studies show that our method is very competitive in terms of both prediction accuracy and variable selection accuracy

    A review of Gaussian Markov models for conditional independence

    Full text link
    Markov models lie at the interface between statistical independence in a probability distribution and graph separation properties. We review model selection and estimation in directed and undirected Markov models with Gaussian parametrization, emphasizing the main similarities and differences. These two model classes are similar but not equivalent, although they share a common intersection. We present the existing results from a historical perspective, taking into account the amount of literature existing from both the artificial intelligence and statistics research communities, where these models were originated. We cover classical topics such as maximum likelihood estimation and model selection via hypothesis testing, but also more modern approaches like regularization and Bayesian methods. We also discuss how the Markov models reviewed fit in the rich hierarchy of other, higher level Markov model classes. Finally, we close the paper overviewing relaxations of the Gaussian assumption and pointing out the main areas of application where these Markov models are nowadays used.Comment: Fix author signatur

    High-dimensional structure learning of binary pairwise Markov networks: A comparative numerical study

    Full text link
    Learning the undirected graph structure of a Markov network from data is a problem that has received a lot of attention during the last few decades. As a result of the general applicability of the model class, a myriad of methods have been developed in parallel in several research fields. Recently, as the size of the considered systems has increased, the focus of new methods has been shifted towards the high-dimensional domain. In particular, introduction of the pseudo-likelihood function has pushed the limits of score-based methods which were originally based on the likelihood function. At the same time, methods based on simple pairwise tests have been developed to meet the challenges arising from increasingly large data sets in computational biology. Apart from being applicable to high-dimensional problems, methods based on the pseudo-likelihood and pairwise tests are fundamentally very different. To compare the accuracy of the different types of methods, an extensive numerical study is performed on data generated by binary pairwise Markov networks. A parallelizable Gibbs sampler, based on restricted Boltzmann machines, is proposed as a tool to efficiently sample from sparse high-dimensional networks. The results of the study show that pairwise methods can be more accurate than pseudo-likelihood methods in settings often encountered in high-dimensional structure learning applications

    Structure Learning in Graphical Modeling

    Full text link
    A graphical model is a statistical model that is associated to a graph whose nodes correspond to variables of interest. The edges of the graph reflect allowed conditional dependencies among the variables. Graphical models admit computationally convenient factorization properties and have long been a valuable tool for tractable modeling of multivariate distributions. More recently, applications such as reconstructing gene regulatory networks from gene expression data have driven major advances in structure learning, that is, estimating the graph underlying a model. We review some of these advances and discuss methods such as the graphical lasso and neighborhood selection for undirected graphical models (or Markov random fields), and the PC algorithm and score-based search methods for directed graphical models (or Bayesian networks). We further review extensions that account for effects of latent variables and heterogeneous data sources

    Kernel-based Tests for Joint Independence

    Full text link
    We investigate the problem of testing whether dd random variables, which may or may not be continuous, are jointly (or mutually) independent. Our method builds on ideas of the two variable Hilbert-Schmidt independence criterion (HSIC) but allows for an arbitrary number of variables. We embed the dd-dimensional joint distribution and the product of the marginals into a reproducing kernel Hilbert space and define the dd-variable Hilbert-Schmidt independence criterion (dHSIC) as the squared distance between the embeddings. In the population case, the value of dHSIC is zero if and only if the dd variables are jointly independent, as long as the kernel is characteristic. Based on an empirical estimate of dHSIC, we define three different non-parametric hypothesis tests: a permutation test, a bootstrap test and a test based on a Gamma approximation. We prove that the permutation test achieves the significance level and that the bootstrap test achieves pointwise asymptotic significance level as well as pointwise asymptotic consistency (i.e., it is able to detect any type of fixed dependence in the large sample limit). The Gamma approximation does not come with these guarantees; however, it is computationally very fast and for small dd, it performs well in practice. Finally, we apply the test to a problem in causal discovery.Comment: 67 page

    Spatial random field models based on L\'evy indicator convolutions

    Full text link
    Process convolutions yield random fields with flexible marginal distributions and dependence beyond Gaussianity, but statistical inference is often hampered by a lack of closed-form marginal distributions, and simulation-based inference may be prohibitively computer-intensive. We here remedy such issues through a class of process convolutions based on smoothing a (d+1)-dimensional L\'evy basis with an indicator function kernel to construct a d-dimensional convolution process. Indicator kernels ensure univariate distributions in the L\'evy basis family, which provides a sound basis for interpretation, parametric modeling and statistical estimation. We propose a class of isotropic stationary convolution processes constructed through hypograph indicator sets defined as the space between the curve (s,H(s)) of a spherical probability density function H and the plane (s,0). If H is radially decreasing, the covariance is expressed through the univariate distribution function of H. The bivariate joint tail behavior in such convolution processes is studied in detail. Simulation and modeling extensions beyond isotropic stationary spatial models are discussed, including latent process models. For statistical inference of parametric models, we develop pairwise likelihood techniques and illustrate these on spatially indexed weed counts in the Bjertop data set, and on daily wind speed maxima observed over 30 stations in the Netherlands

    A parallel algorithm for penalized learning of the multivariate exponential family from data of mixed types

    Full text link
    Computational efficient evaluation of penalized estimators of multivariate exponential family distributions is sought. These distributions encompass amongst others Markov random fields with variates of mixed type (e.g. binary and continuous) as special case of interest. The model parameter is estimated by maximization of the pseudo-likelihood augmented with a convex penalty. The estimator is shown to be consistent. With a world of multi-core computers in mind, a computationally efficient parallel Newton-Raphson algorithm is presented for numerical evaluation of the estimator alongside conditions for its convergence. Parallelization comprises the division of the parameter vector into subvectors that are estimated simultaneously and subsequently aggregated to form an estimate of the original parameter. This approach may also enable efficient numerical evaluation of other high-dimensional estimators. The performance of the proposed estimator and algorithm are evaluated in a simulation study, and the paper concludes with an illustration of the presented methodology in the reconstruction of the conditional independence network from data of an integrative omics study

    A General Framework for Mixed Graphical Models

    Full text link
    "Mixed Data" comprising a large number of heterogeneous variables (e.g. count, binary, continuous, skewed continuous, among other data types) are prevalent in varied areas such as genomics and proteomics, imaging genetics, national security, social networking, and Internet advertising. There have been limited efforts at statistically modeling such mixed data jointly, in part because of the lack of computationally amenable multivariate distributions that can capture direct dependencies between such mixed variables of different types. In this paper, we address this by introducing a novel class of Block Directed Markov Random Fields (BDMRFs). Using the basic building block of node-conditional univariate exponential families from Yang et al. (2012), we introduce a class of mixed conditional random field distributions, that are then chained according to a block-directed acyclic graph to form our class of Block Directed Markov Random Fields (BDMRFs). The Markov independence graph structure underlying a BDMRF thus has both directed and undirected edges. We introduce conditions under which these distributions exist and are normalizable, study several instances of our models, and propose scalable penalized conditional likelihood estimators with statistical guarantees for recovering the underlying network structure. Simulations as well as an application to learning mixed genomic networks from next generation sequencing expression data and mutation data demonstrate the versatility of our methods.Comment: 40 pages, 9 figure
    corecore