1,452 research outputs found

    Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems

    Full text link
    Approximate Bayesian computation methods can be used to evaluate posterior distributions without having to calculate likelihoods. In this paper we discuss and apply an approximate Bayesian computation (ABC) method based on sequential Monte Carlo (SMC) to estimate parameters of dynamical models. We show that ABC SMC gives information about the inferability of parameters and model sensitivity to changes in parameters, and tends to perform better than other ABC approaches. The algorithm is applied to several well known biological systems, for which parameters and their credible intervals are inferred. Moreover, we develop ABC SMC as a tool for model selection; given a range of different mathematical descriptions, ABC SMC is able to choose the best model using the standard Bayesian model selection apparatus.Comment: 26 pages, 9 figure

    Mixed membership stochastic blockmodels

    Full text link
    Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we describe a latent variable model of such data called the mixed membership stochastic blockmodel. This model extends blockmodels for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation. We develop a general variational inference algorithm for fast approximate posterior inference. We explore applications to social and protein interaction networks.Comment: 46 pages, 14 figures, 3 table

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Distributed Estimation and Inference for the Analysis of Big Biomedical Data

    Full text link
    This thesis focuses on developing and implementing new statistical methods to address some of the current difficulties encountered in the analysis of high-dimensional correlated biomedical data. Following the divide-and-conquer paradigm, I develop a theoretically sound and computationally tractable class of distributed statistical methods that are made accessible to practitioners through R statistical software. This thesis aims to establish a class of distributed statistical methods for regression analyses with very large outcome variables arising in many biomedical fields, such as in metabolomic or imaging research. The general distributed procedure divides data into blocks that are analyzed on a parallelized computational platform and combines these separate results via Hansen’s (1982) generalized method of moments. These new methods provide distributed and efficient statistical inference in many different regression settings. Computational efficiency is achieved by leveraging recent developments in large scale computing, such as the MapReduce paradigm on the Hadoop platform. In the first project presented in Chapter III, I develop a divide-and-conquer procedure implemented in a parallelized computational scheme for statistical estimation and inference of regression parameters with high-dimensional correlated responses. This project is motivated by an electroencephalography study whose goal is to determine the effect of iron deficiency on infant auditory recognition memory. The proposed method (published as Hector and Song (2020a)), the Distributed and Integrated Method of Moments (DIMM), divides responses into subvectors to be analyzed in parallel using pairwise composite likelihood, and combines results using an optimal one-step meta-estimator. In the second project presented in Chapter IV, I develop an extended theoretical framework of distributed estimation and inference to incorporate a broad range of classical statistical models and biomedical data types. To reduce computational speed and meet data privacy demands, I propose to divide data by outcomes and subjects, leading to a doubly divide-and-conquer paradigm. I also address parameter heterogeneity explicitly for added flexibility. I establish a new theoretical framework for the analysis of a broad class of big data problems to facilitate valid statistical inference for biomedical researchers. Possible applications include genomic data, metabolomic data, longitudinal and spatial data, and many more. In the third project presented in Chapter V, I propose a distributed quadratic inference function framework to jointly estimate regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. This project is motivated by the analysis of the association between smoking and metabolites in a large cohort study. The primary goal of this joint integrative analysis is to estimate covariate effects on all outcomes through a marginal regression model in a statistically and computationally efficient way. To overcome computational and modeling challenges arising from the high-dimensional likelihood of the correlated vector outcomes, I propose to analyze each data source using Qu et al.’s quadratic inference funtions, and then to jointly reestimate parameters from each data source by accounting for correlation between data sources.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163220/1/ehector_1.pd

    Parallel ant colony optimization for the training of cell signaling networks

    Get PDF
    [Abstract]: Acquiring a functional comprehension of the deregulation of cell signaling networks in disease allows progress in the development of new therapies and drugs. Computational models are becoming increasingly popular as a systematic tool to analyze the functioning of complex biochemical networks, such as those involved in cell signaling. CellNOpt is a framework to build predictive logic-based models of signaling pathways by training a prior knowledge network to biochemical data obtained from perturbation experiments. This training can be formulated as an optimization problem that can be solved using metaheuristics. However, the genetic algorithm used so far in CellNOpt presents limitations in terms of execution time and quality of solutions when applied to large instances. Thus, in order to overcome those issues, in this paper we propose the use of a method based on ant colony optimization, adapted to the problem at hand and parallelized using a hybrid approach. The performance of this novel method is illustrated with several challenging benchmark problems in the study of new therapies for liver cancer
    • …
    corecore