229 research outputs found

    Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression

    Get PDF
    An important research problem in computational biology is theidentification of expression programs, sets of co-activatedgenes orchestrating physiological processes, and thecharacterization of the functional breadth of these programs. Theuse of mammalian expression data compendia for discovery of suchprograms presents several challenges, including: 1) cellularinhomogeneity within samples, 2) genetic and environmental variationacross samples, and 3) uncertainty in the numbers of programs andsample populations. We developed GeneProgram, a new unsupervisedcomputational framework that uses expression data to simultaneouslyorganize genes into overlapping programs and tissues into groups toproduce maps of inter-species expression programs, which are sortedby generality scores that exploit the automatically learnedgroupings. Our method addresses each of the above challenges byusing a probabilistic model that: 1) allocates mRNA to differentexpression programs that may be shared across tissues, 2) ishierarchical, treating each tissue as a sample from a population ofrelated tissues, and 3) uses Dirichlet Processes, a non-parametricBayesian method that provides prior distributions over numbers ofsets while penalizing model complexity. Using real gene expressiondata, we show that GeneProgram outperforms several popularexpression analysis methods in recovering biologically interpretablegene sets. From a large compendium of mouse and human expressiondata, GeneProgram discovers 19 tissue groups and 100 expressionprograms active in mammalian tissues. Our method automaticallyconstructs a comprehensive, body-wide map of expression programs andcharacterizes their functional generality. This map can be used forguiding future biological experiments, such as discovery of genesfor new drug targets that exhibit minimal "cross-talk" withunintended organs, or genes that maintain general physiologicalresponses that go awry in disease states. Further, our method isgeneral, and can be applied readily to novel compendia of biologicaldata

    Childhood obesity in Singapore: A Bayesian nonparametric approach

    Get PDF
    Overweight and obesity in adults are known to be associated with increased risk of metabolic and cardiovascular diseases. Obesity has now reached epidemic proportions, increasingly affecting children. Therefore, it is important to understand if this condition persists from early life to childhood and if different patterns can be detected to inform intervention policies. Our motivating application is a study of temporal patterns of obesity in children from South Eastern Asia. Our main focus is on clustering obesity patterns after adjusting for the effect of baseline information. Specifically, we consider a joint model for height and weight over time. Measurements are taken every six months from birth. To allow for data-driven clustering of trajectories, we assume a vector autoregressive sampling model with a dependent logit stick-breaking prior. Simulation studies show good performance of the proposed model to capture overall growth patterns, as compared to other alternatives.We also fit themodel to the motivating dataset, and discuss the results, in particular highlighting cluster differences. We have found four large clusters, corresponding to children sub-groups, though two of them are similar in terms of both height and weight at each time point. We provide interpretation of these clusters in terms of combinations of predictors

    Bayesian Latent Variable Methods for Longitudinal Processes with Applications to Fetal Growth

    Get PDF
    We consider methods for joint models of exposure and response in epidemiologic studies. In particular, we show how latent variable methods provide a structure for obtaining inference about multistate growth processes and multiple longitudinal and cross-sectional outcomes. Each model utilizes underlying, subject-specific latent variables to account for the correlation that arises from taking multiple observations on the same sampling unit. We also consider latent variable mixture models in order to more flexibly model the latent variable distributions and identify latent classes of subjects who are of particular scientific importance. We apply our methods to applications in reproductive health, obtaining interesting new insights while developing and applying statistical methodology. We first consider the problem of estimating a multistate growth process with unknown initiation time to determine individual early fetal growth. Using cross-sectional data, we identify fetuses that have a latent tendency to grow relatively quickly and slowly and show that slow growth early in pregnancy is associated with an increased risk of future pregnancy loss. These results are important to researchers who use early ultrasounds to date pregnancies under the assumption that there is no measurable variability in early fetal growth. Paper two is concerned with jointly modeling the unusual, asymmetric distributions of birth weight and gestational age. Using latent variable mixture models, we identify a latent class of subjects who are more likely to deliver early and have low weight. We also allow observed covariates to be associated with latent class membership. Our approach provides researchers a new method for examining low birth weight and pre-term birth. In paper three, we aggregate multiple ultrasound measurements on fetal size and blood restriction using latent variables that follow mixture distributions to identify a latent class of subjects who are growth restricted during pregnancy. We then consider a joint model that examines the associations between covariates, early growth restriction, and outcomes measured at birth. Our methods are able to identify a latent class of subjects who have increased blood flow restriction and below average intrauterine size during the second trimester who are more likely to be growth restricted at birth

    Nonparametric Bayesian analysis of some clustering problems

    Get PDF
    Nonparametric Bayesian models have been researched extensively in the past 10 years following the work of Escobar and West (1995) on sampling schemes for Dirichlet processes. The infinite mixture representation of the Dirichlet process makes it useful for clustering problems where the number of clusters is unknown. We develop nonparametric Bayesian models for two different clustering problems, namely functional and graphical clustering. We propose a nonparametric Bayes wavelet model for clustering of functional or longitudinal data. The wavelet modelling is aimed at the resolution of global and local features during clustering. The model also allows the elicitation of prior belief about the regularity of the functions and has the ability to adapt to a wide range of functional regularity. Posterior inference is carried out by Gibbs sampling with conjugate priors for fast computation. We use simulated as well as real datasets to illustrate the suitability of the approach over other alternatives. The functional clustering model is extended to analyze splice microarray data. New microarray technologies probe consecutive segments along genes to observe alternative splicing (AS) mechanisms that produce multiple proteins from a single gene. Clues regarding the number of splice forms can be obtained by clustering the functional expression profiles from different tissues. The analysis was carried out on the Rosetta dataset (Johnson et al., 2003) to obtain a splice variant by tissue distribution for all the 10,000 genes. We were able to identify a number of splice forms that appear to be unique to cancer. We propose a Bayesian model for partitioning graphs depicting dependencies in a collection of objects. After suitable transformations and modelling techniques, the problem of graph cutting can be approached by nonparametric Bayes clustering. We draw motivation from a recent work (Dhillon, 2001) showing the equivalence of kernel k-means clustering and certain graph cutting algorithms. It is shown that loss functions similar to the kernel k-means naturally arise in this model, and the minimization of associated posterior risk comprises an effective graph cutting strategy. We present here results from the analysis of two microarray datasets, namely the melanoma dataset (Bittner et al., 2000) and the sarcoma dataset (Nykter et al., 2006)

    Computational discovery of gene modules, regulatory networks and expression programs

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2007.Includes bibliographical references (p. 163-181).High-throughput molecular data are revolutionizing biology by providing massive amounts of information about gene expression and regulation. Such information is applicable both to furthering our understanding of fundamental biology and to developing new diagnostic and treatment approaches for diseases. However, novel mathematical methods are needed for extracting biological knowledge from high-dimensional, complex and noisy data sources. In this thesis, I develop and apply three novel computational approaches for this task. The common theme of these approaches is that they seek to discover meaningful groups of genes, which confer robustness to noise and compress complex information into interpretable models. I first present the GRAM algorithm, which fuses information from genome-wide expression and in vivo transcription factor-DNA binding data to discover regulatory networks of gene modules. I use the GRAM algorithm to discover regulatory networks in Saccharomyces cerevisiae, including rich media, rapamycin, and cell-cycle module networks. I use functional annotation databases, independent biological experiments and DNA-motif information to validate the discovered networks, and to show that they yield new biological insights. Second, I present GeneProgram, a framework based on Hierarchical Dirichlet Processes, which uses large compendia of mammalian expression data to simultaneously organize genes into overlapping programs and tissues into groups to produce maps of expression programs. I demonstrate that GeneProgram outperforms several popular analysis methods, and using mouse and human expression data, show that it automatically constructs a comprehensive, body-wide map of inter-species expression programs.(cont.) Finally, I present an extension of GeneProgram that models temporal dynamics. I apply the algorithm to a compendium of short time-series gene expression experiments in which human cells were exposed to various infectious agents. I show that discovered expression programs exhibit temporal pattern usage differences corresponding to classes of host cells and infectious agents, and describe several programs that implicate surprising signaling pathways and receptor types in human responses to infection.by Georg Kurt Gerber.Ph.D

    Proceedings of the 35th International Workshop on Statistical Modelling : July 20- 24, 2020 Bilbao, Basque Country, Spain

    Get PDF
    466 p.The InternationalWorkshop on Statistical Modelling (IWSM) is a reference workshop in promoting statistical modelling, applications of Statistics for researchers, academics and industrialist in a broad sense. Unfortunately, the global COVID-19 pandemic has not allowed holding the 35th edition of the IWSM in Bilbao in July 2020. Despite the situation and following the spirit of the Workshop and the Statistical Modelling Society, we are delighted to bring you the proceedings book of extended abstracts

    On Dependent Processes in Bayesian Nonparametrics: Theory, Methods, and Applications

    Get PDF
    The main topics of the thesis are dependent processes and their uses in Bayesian nonparametric statistics. With the term dependent processes, we refer to two or more infinite dimensional random objects, i.e., random probability measures, completely random measures, and random partitions, whose joint probability law does not factorize and, thus, encodes non-trivial dependence. We investigate properties and limits of existing nonparametric dependent priors and propose new dependent processes that fill gaps in the existing literature. To do so, we first define a class of priors, namely multivariate species sampling processes, which encompasses many dependent processes used in Bayesian nonparametrics. We derive a series of theoretical results for the priors within this class, keeping as main focus the dependence induced between observations as well as between random probability measures. Then, in light of our theoretical findings, as well as considering specific motivating applications, we develop novel prior processes outside this class, enlarging the types of data structures and prior information that can be handled by the Bayesian nonparametric approach. We propose three new classes of dependent processes: full-range borrowing of information priors, invariant dependent priors (with a focus on symmetric hierarchical Dirichlet processes), and dependent priors for panel count data. Full-range borrowing of information priors are dependent random probability measures that may induce either positive or negative correlation across observations and, thus, they achieve high flexibility in the type of induced dependence. Moreover, they introduce an innovative idea of borrowing of information across samples which differs from classical shrinkage. Invariant dependent priors are instead dependent random probabilities that almost surely satisfy a specified invariance condition, e.g., symmetry. They may be employed both when a priori knowledge on the shape of the unknown distribution is available or, as we do, to flexibly model errors terms in complex models without losing identifiability of other parameters of interest. Finally, dependent priors for panel count data are flexible priors based on completely random measures, that take into account dependence between the observed counts and the frequency of observation in panel count data studies. We study a priori and a posteriori properties of all the proposed models, develop algorithms to derive inference, compare the performances of our proposals with existing methods, and apply these constructions to simulated and real datasets. Through all the thesis, we try to balance theoretical and methodological results with real-world applications

    Nonparametric Bayes methods for high dimensional data and group sequential design for longitudinal trials

    Get PDF
    High-dimensional unordered categorical data appear in a number of areas ranging from epidemiology, behavioral and social sciences, etc. Such data can be placed into a large contingency table with cell counts defined as the number of subjects with a given combination of variables values. The contingency table is often sparse in practice in the sense that only a few cells have more than a few counts, with most cells being empty. Traditional approaches for contingency table analysis fail to scale up to moderate dimensions, and alternative approaches based on tensor decomposition are promising. This motivates us to develop sparse tensor decompositions for multivariate categorical variables where the number of variables can be potentially larger than the sample size. The methods are shown to have excellent performance in simulations, and results in various data sets are presented. In paper 2, we consider such high-dimensional data in case-control studies, with the main goal being detection of the sparse subset of predictors having a significant association with disease. We propose a new approach based on a nonparametric Bayesian low rank tensor factorization to model the retrospective likelihood. Our model allows a very flexible structure in characterizing the distribution of multivariate variables as unknown and without any linearity assumptions as in logistic regression. Predictors are excluded only if they have no impact on disease risk, either directly or through interactions with other predictors. Hence, we obtain an omnibus approach for screening for important predictors. Computation relies on an efficient Gibbs sampler. The methods are shown to have higher power and lower false discovery rates in simulation studies relative to existing methods, and we consider an application to an epidemiologic study of birth defects. In paper 3, our goal is to design a longitudinal trial using group sequential design. We propose an information-based sample size re-estimation method to update the sample size at each interim analysis, which maintains the target power while controlling the type-I error rate. We illustrate our strategy by data analysis examples and simulations and compare the results with those obtained using fixed design and group-sequential design without sample size re-estimation.Doctor of Philosoph
    corecore