129,661 research outputs found

    Penalized Estimation of Directed Acyclic Graphs From Discrete Data

    Full text link
    Bayesian networks, with structure given by a directed acyclic graph (DAG), are a popular class of graphical models. However, learning Bayesian networks from discrete or categorical data is particularly challenging, due to the large parameter space and the difficulty in searching for a sparse structure. In this article, we develop a maximum penalized likelihood method to tackle this problem. Instead of the commonly used multinomial distribution, we model the conditional distribution of a node given its parents by multi-logit regression, in which an edge is parameterized by a set of coefficient vectors with dummy variables encoding the levels of a node. To obtain a sparse DAG, a group norm penalty is employed, and a blockwise coordinate descent algorithm is developed to maximize the penalized likelihood subject to the acyclicity constraint of a DAG. When interventional data are available, our method constructs a causal network, in which a directed edge represents a causal relation. We apply our method to various simulated and real data sets. The results show that our method is very competitive, compared to many existing methods, in DAG estimation from both interventional and high-dimensional observational data.Comment: To appear in Statistics and Computin

    Health and Work of the Elderly: Subjective Health Measures, Reporting Errors and the Endogenous Relationship Between Health and Work

    Get PDF
    In empirical studies of retirement decisions of the elderly, health is often found to have a large, if not dominant, effect. Depending on which health measures are used, these estimated effects may be biased estimates of the causal effect of health on the dependent variable(s).Research indicates that subjective, self-assessed health measures may be affected by endogenous reporting behaviour and even if an objective health measure is used, it is not likely to be strictly exogenous to labour market status or labour income. Health and labour market variables will be correlated because of unobserved individual-specific characteristics (e.g., investments in human capital and health capital). Moreover, one's labour market status may be expected to have a (reverse) causal effect on current and future health. In this paper we analyse the relative importance of these endogeneity and measurement issues in the context of a model of early retirement decisions. We state assumptions under which we can use relatively simple methods to assess the relative importance of state dependent reporting errors in individual responses to health questions. The estimation results indicate that among respondents receiving disability insurance allowance, reporting errors are large and systematic and that therefore using these measures in retirement models may seriously bias the parameter estimates and the conclusions drawn from these. We furthermore found that health deteriorates with work and that the two variables are endogenously related.

    Collaborative Targeted Maximum Likelihood Estimation

    Get PDF
    Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of causal inference and variable importance parameters. The targeted maximum likelihood approach involves fluctuating an initial density estimate, (Q), in order to make a bias/variance tradeoff targeted towards a specific parameter in a semi-parametric model. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE and other double robust estimators have been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified. In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the relevant factor Q_0, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators in this sequence. A penalized-likelihood loss function for Q_0 is suggested when the parameter of interest is borderline-identifiable. We present theoretical results for collaborative double robustness, demonstrating that the collaborative targeted maximum likelihood estimator is CAN when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q_0. This marks an improvement over the current definition of double robustness in the estimating equation literature. We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter. This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism

    Learning stable and predictive structures in kinetic systems: Benefits of a causal approach

    Get PDF
    Learning kinetic systems from data is one of the core challenges in many fields. Identifying stable models is essential for the generalization capabilities of data-driven inference. We introduce a computationally efficient framework, called CausalKinetiX, that identifies structure from discrete time, noisy observations, generated from heterogeneous experiments. The algorithm assumes the existence of an underlying, invariant kinetic model, a key criterion for reproducible research. Results on both simulated and real-world examples suggest that learning the structure of kinetic systems benefits from a causal perspective. The identified variables and models allow for a concise description of the dynamics across multiple experimental settings and can be used for prediction in unseen experiments. We observe significant improvements compared to well established approaches focusing solely on predictive performance, especially for out-of-sample generalization

    Optimization for Explainable Modeling

    Get PDF
    Whether it is the signaling mechanisms behind immune cells or the change in animal populations, mechanistic models such as agent based models or systems of differential equations that define explicit causal mechanisms are used to validate hypothesises and thereby understand physical systems. To quantitatively and qualitatively validate a mechanistic model, experimental data is used to fit and estimate parameters within these models, thereby providing interpretable and explainable quantitative values. Parameter estimation tasks for mechanistic models can be extremely challenging for a variety of reasons, especially for single-cell systems. One, measurements of protein abundances can vary many orders of magnitude and often the number of model parameters exceeds that of the data. Two, mechanistic simulations can often be computationally expensive where parameter estimation can range from hours to days, and even more when fine-tuning an optimization algorithm. Through building a framework BioNetGMMFit, we show that we can readily account for the large variances within single-cell models using generalized method of moments, and through leveraging deep learning in surrogate modeling, we show that we can reduce the computational time complexity in parameter estimation.No embargoAcademic Major: Computer Science and Engineerin

    Learning Large-Scale Bayesian Networks with the sparsebn Package

    Get PDF
    Learning graphical models from data is an important problem with wide applications, ranging from genomics to the social sciences. Nowadays datasets often have upwards of thousands---sometimes tens or hundreds of thousands---of variables and far fewer samples. To meet this challenge, we have developed a new R package called sparsebn for learning the structure of large, sparse graphical models with a focus on Bayesian networks. While there are many existing software packages for this task, this package focuses on the unique setting of learning large networks from high-dimensional data, possibly with interventions. As such, the methods provided place a premium on scalability and consistency in a high-dimensional setting. Furthermore, in the presence of interventions, the methods implemented here achieve the goal of learning a causal network from data. Additionally, the sparsebn package is fully compatible with existing software packages for network analysis.Comment: To appear in the Journal of Statistical Software, 39 pages, 7 figure
    corecore