6,287 research outputs found

    A new insight into underlying disease mechanism through semi-parametric latent differential network model

    Full text link
    Background In genomic studies, to investigate how the structure of a genetic network differs between two experiment conditions is a very interesting but challenging problem, especially in high-dimensional setting. Existing literatures mostly focus on differential network modelling for continuous data. However, in real application, we may encounter discrete data or mixed data, which urges us to propose a unified differential network modelling for various data types. Results We propose a unified latent Gaussian copula differential network model which provides deeper understanding of the unknown mechanism than that among the observed variables. Adaptive rank-based estimation approaches are proposed with the assumption that the true differential network is sparse. The adaptive estimation approaches do not require precision matrices to be sparse, and thus can allow the individual networks to contain hub nodes. Theoretical analysis shows that the proposed methods achieve the same parametric convergence rate for both the difference of the precision matrices estimation and differential structure recovery, which means that the extra modeling flexibility comes at almost no cost of statistical efficiency. Besides theoretical analysis, thorough numerical simulations are conducted to compare the empirical performance of the proposed methods with some other state-of-the-art methods. The result shows that the proposed methods work quite well for various data types. The proposed method is then applied on gene expression data associated with lung cancer to illustrate its empirical usefulness. Conclusions The proposed latent variable differential network models allows for various data-types and thus are more flexible, which also provide deeper understanding of the unknown mechanism than that among the observed variables. Theoretical analysis, numerical simulation and real application all demonstrate the great advantages of the latent differential network modelling and thus are highly recommended

    Topics in social network analysis and network science

    Full text link
    This chapter introduces statistical methods used in the analysis of social networks and in the rapidly evolving parallel-field of network science. Although several instances of social network analysis in health services research have appeared recently, the majority involve only the most basic methods and thus scratch the surface of what might be accomplished. Cutting-edge methods using relevant examples and illustrations in health services research are provided

    A Primer on Causality in Data Science

    Get PDF
    Many questions in Data Science are fundamentally causal in that our objective is to learn the effect of some exposure, randomized or not, on an outcome interest. Even studies that are seemingly non-causal, such as those with the goal of prediction or prevalence estimation, have causal elements, including differential censoring or measurement. As a result, we, as Data Scientists, need to consider the underlying causal mechanisms that gave rise to the data, rather than simply the pattern or association observed in those data. In this work, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to provide an introduction to some key concepts in causal inference. Similar to other causal frameworks, the steps of the Roadmap include clearly stating the scientific question, defining of the causal model, translating the scientific question into a causal parameter, assessing the assumptions needed to express the causal parameter as a statistical estimand, implementation of statistical estimators including parametric and semi-parametric methods, and interpretation of our findings. We believe that using such a framework in Data Science will help to ensure that our statistical analyses are guided by the scientific question driving our research, while avoiding over-interpreting our results. We focus on the effect of an exposure occurring at a single time point and highlight the use of targeted maximum likelihood estimation (TMLE) with Super Learner.Comment: 26 pages (with references); 4 figure

    A survey of statistical network models

    Full text link
    Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference

    Novel Methods for Multivariate Ordinal Data applied to Genetic Diplotypes, Genomic Pathways, Risk Profiles, and Pattern Similarity

    Get PDF
    Introduction: Conventional statistical methods for multivariate data (e.g., discriminant/regression) are based on the (generalized) linear model, i.e., the data are interpreted as points in a Euclidian space of independent dimensions. The dimensionality of the data is then reduced by assuming the components to be related by a specific function of known type (linear, exponential, etc.), which allows the distance of each point from a hyperspace to be determined. While mathematically elegant, these approaches may have shortcomings when applied to real world applications where the relative importance, the functional relationship, and the correlation among the variables tend to be unknown. Still, in many applications, each variable can be assumed to have at least an “orientation”, i.e., it can reasonably assumed that, if all other conditions are held constant, an increase in this variable is either “good” or “bad”. The direction of this orientation can be known or unknown. In genetics, for instance, having more “abnormal” alleles may increase the risk (or magnitude) of a disease phenotype. In genomics, the expression of several related genes may indicate disease activity. When screening for security risks, more indicators for atypical behavior may constitute raise more concern, in face or voice recognition, more indicators being similar may increase the likelihood of a person being identified. Methods: In 1998, we developed a nonparametric method for analyzing multivariate ordinal data to assess the overall risk of HIV infection based on different types of behavior or the overall protective effect of barrier methods against HIV infection. By using u-statistics, rather than the marginal likelihood, we were able to increase the computational efficiency of this approach by several orders of magnitude. Results: We applied this approach to assessing immunogenicity of a vaccination strategy in cancer patients. While discussing the pitfalls of the conventional methods for linking quantitative traits to haplotypes, we realized that this approach could be easily modified into to a statistically valid alternative to a previously proposed approaches. We have now begun to use the same methodology to correlate activity of anti-inflammatory drugs along genomic pathways with disease severity of psoriasis based on several clinical and histological characteristics. Conclusion: Multivariate ordinal data are frequently observed to assess semiquantitative characteristics, such as risk profiles (genetic, genomic, or security) or similarity of pattern (faces, voices, behaviors). The conventional methods require empirical validation, because the functions and weights chosen cannot be justified on theoretical grounds. The proposed statistical method for analyzing profiles of ordinal variables, is intrinsically valid. Since no additional assumptions need to be made, the often time-consuming empirical validation can be skipped.ranking; nonparametric; robust; scoring; multivariate

    Learning with Limited Labeled Data in Biomedical Domain by Disentanglement and Semi-Supervised Learning

    Get PDF
    In this dissertation, we are interested in improving the generalization of deep neural networks for biomedical data (e.g., electrocardiogram signal, x-ray images, etc). Although deep neural networks have attained state-of-the-art performance and, thus, deployment across a variety of domains, similar performance in the clinical setting remains challenging due to its ineptness to generalize across unseen data (e.g., new patient cohort). We address this challenge of generalization in the deep neural network from two perspectives: 1) learning disentangled representations from the deep network, and 2) developing efficient semi-supervised learning (SSL) algorithms using the deep network. In the former, we are interested in designing specific architectures and objective functions to learn representations, where variations in the data are well separated, i.e., disentangled. In the latter, we are interested in designing regularizers that encourage the underlying neural function\u27s behavior toward a common inductive bias to avoid over-fitting the function to small labeled data. Our end goal is to improve the generalization of the deep network for the diagnostic model in both of these approaches. In disentangled representations, this translates to appropriately learning latent representations from the data, capturing the observed input\u27s underlying explanatory factors in an independent and interpretable way. With data\u27s expository factors well separated, such disentangled latent space can then be useful for a large variety of tasks and domains within data distribution even with a small amount of labeled data, thus improving generalization. In developing efficient semi-supervised algorithms, this translates to utilizing a large volume of the unlabelled dataset to assist the learning from the limited labeled dataset, commonly encountered situation in the biomedical domain. By drawing ideas from different areas within deep learning like representation learning (e.g., autoencoder), variational inference (e.g., variational autoencoder), Bayesian nonparametric (e.g., beta-Bernoulli process), learning theory (e.g., analytical learning theory), function smoothing (Lipschitz Smoothness), etc., we propose several leaning algorithms to improve generalization in the associated task. We test our algorithms on real-world clinical data and show that our approach yields significant improvement over existing methods. Moreover, we demonstrate the efficacy of the proposed models in the benchmark data and simulated data to understand different aspects of the proposed learning methods. We conclude by identifying some of the limitations of the proposed methods, areas of further improvement, and broader future directions for the successful adoption of AI models in the clinical environment

    Contributions to Mediation Analysis and First Principles Modeling for Mechanistic Statistical Analysis

    Full text link
    This thesis contains three projects that propose novel methods for studying mechanisms that explain statistical relationships. The ultimate goal of each of these methods is to help researchers describe how or why complex relationships between observed variables exist. The first project proposes and studies a method for recovering mediation structure in high dimensions. We take a dimension reduction approach that generalizes the ``product of coefficients'' concept for univariate mediation analysis through the optimization of a loss function. We devise an efficient algorithm for optimizing the product-of-coefficients inspired loss function. Through extensive simulation studies, we show that the method is capable of consistently identifying mediation structure. Finally, two case studies are presented that demonstrate how the method can be used to conduct multivariate mediation analysis. The second project uses tools from conditional inference to improve the calibration of tests of univariate mediation hypotheses. The key insight of the project is that the non-Euclidean geometry of the null parameter space causes the test statistic’s sampling distribution to depend on a nuisance parameter. After identifying a statistic that is both sufficient for the nuisance parameter and approximately ancillary for the parameter of interest, we derive the test statistic’s limiting conditional sampling distribution. We additionally develop a non-standard bootstrap procedure for calibration in finite samples. We demonstrate through simulation studies that improved evidence calibration leads to substantial power increases over existing methods. This project suggests that conditional inference might be a useful tool in evidence calibration for other non-standard or otherwise challenging problems. In the last project, we present a methodological contribution to a pharmaceutical science study of {em in vivo} ibuprofen pharmacokinetics. We demonstrate how model misspecification in a first-principles analysis can be addressed by augmenting the model to include a term corresponding to an omitted source of variation. In previously used first-principles models, gastric emptying, which is pulsatile and stochastic, is modeled as first-order diffusion for simplicity. However, analyses suggest that the actual gastric emptying process is expected to be a unimodal smooth function, with phase and amplitude varying by subject. Therefore, we adopt a flexible approach in which a highly idealized parametric version of gastric emptying is combined with a Gaussian process to capture deviations from the idealized form. These functions are characterized by their distributions, which allows us to learn their common and unique features across subjects despite that these features are not directly observed. Through simulation studies, we show that the proposed approach is able to identify certain features of latent function distributions.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163026/1/josephdi_1.pd
    • …
    corecore