52,630 research outputs found

    Diverse correlation structures in gene expression data and their utility in improving statistical inference

    Full text link
    It is well known that correlations in microarray data represent a serious nuisance deteriorating the performance of gene selection procedures. This paper is intended to demonstrate that the correlation structure of microarray data provides a rich source of useful information. We discuss distinct correlation substructures revealed in microarray gene expression data by an appropriate ordering of genes. These substructures include stochastic proportionality of expression signals in a large percentage of all gene pairs, negative correlations hidden in ordered gene triples, and a long sequence of weakly dependent random variables associated with ordered pairs of genes. The reported striking regularities are of general biological interest and they also have far-reaching implications for theory and practice of statistical methods of microarray data analysis. We illustrate the latter point with a method for testing differential expression of nonoverlapping gene pairs. While designed for testing a different null hypothesis, this method provides an order of magnitude more accurate control of type 1 error rate compared to conventional methods of individual gene expression profiling. In addition, this method is robust to the technical noise. Quantitative inference of the correlation structure has the potential to extend the analysis of microarray data far beyond currently practiced methods.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS120 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Inferential stability in systems biology

    Get PDF
    The modern biological sciences are fraught with statistical difficulties. Biomolecular stochasticity, experimental noise, and the “large p, small n” problem all contribute to the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful conclusions from observations. In this thesis, we explore methods for assessing the effects of data variability upon downstream inference, in an attempt to quantify and promote the stability of the inferences we make. We start with a review of existing methods for addressing this problem, focusing upon the bootstrap and similar methods. The key requirement for all such approaches is a statistical model that approximates the data generating process. We move on to consider biomarker discovery problems. We present a novel algorithm for proposing putative biomarkers on the strength of both their predictive ability and the stability with which they are selected. In a simulation study, we find our approach to perform favourably in comparison to strategies that select on the basis of predictive performance alone. We then consider the real problem of identifying protein peak biomarkers for HAM/TSP, an inflammatory condition of the central nervous system caused by HTLV-1 infection. We apply our algorithm to a set of SELDI mass spectral data, and identify a number of putative biomarkers. Additional experimental work, together with known results from the literature, provides corroborating evidence for the validity of these putative biomarkers. Having focused on static observations, we then make the natural progression to time course data sets. We propose a (Bayesian) bootstrap approach for such data, and then apply our method in the context of gene network inference and the estimation of parameters in ordinary differential equation models. We find that the inferred gene networks are relatively unstable, and demonstrate the importance of finding distributions of ODE parameter estimates, rather than single point estimates

    Network estimation in State Space Model with L1-regularization constraint

    Full text link
    Biological networks have arisen as an attractive paradigm of genomic science ever since the introduction of large scale genomic technologies which carried the promise of elucidating the relationship in functional genomics. Microarray technologies coupled with appropriate mathematical or statistical models have made it possible to identify dynamic regulatory networks or to measure time course of the expression level of many genes simultaneously. However one of the few limitations fall on the high-dimensional nature of such data coupled with the fact that these gene expression data are known to include some hidden process. In that regards, we are concerned with deriving a method for inferring a sparse dynamic network in a high dimensional data setting. We assume that the observations are noisy measurements of gene expression in the form of mRNAs, whose dynamics can be described by some unknown or hidden process. We build an input-dependent linear state space model from these hidden states and demonstrate how an incorporated L1L_{1} regularization constraint in an Expectation-Maximization (EM) algorithm can be used to reverse engineer transcriptional networks from gene expression profiling data. This corresponds to estimating the model interaction parameters. The proposed method is illustrated on time-course microarray data obtained from a well established T-cell data. At the optimum tuning parameters we found genes TRAF5, JUND, CDK4, CASP4, CD69, and C3X1 to have higher number of inwards directed connections and FYB, CCNA2, AKT1 and CASP8 to be genes with higher number of outwards directed connections. We recommend these genes to be object for further investigation. Caspase 4 is also found to activate the expression of JunD which in turn represses the cell cycle regulator CDC2.Comment: arXiv admin note: substantial text overlap with arXiv:1308.359

    Sparse reduced-rank regression for imaging genetics studies: models and applications

    Get PDF
    We present a novel statistical technique; the sparse reduced rank regression (sRRR) model which is a strategy for multivariate modelling of high-dimensional imaging responses and genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity in the regression coefficients, identifying subsets of genetic markers that best explain the variability observed in subsets of the phenotypes. To properly exploit the rich structure present in each of the imaging and genetics domains, we additionally propose the use of several structured penalties within the sRRR model. Using simulation procedures that accurately reflect realistic imaging genetics data, we present detailed evaluations of the sRRR method in comparison with the more traditional univariate linear modelling approach. In all settings considered, we show that sRRR possesses better power to detect the deleterious genetic variants. Moreover, using a simple genetic model, we demonstrate the potential benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to extracting averages over regions of interest in the brain. Since this entails the use of phenotypic vectors of enormous dimensionality, we suggest the use of a sparse classification model as a de-noising step, prior to the imaging genetics study. Finally, we present the application of a data re-sampling technique within the sRRR model for model selection. Using this approach we are able to rank the genetic markers in order of importance of association to the phenotypes, and similarly rank the phenotypes in order of importance to the genetic markers. In the very end, we illustrate the application perspective of the proposed statistical models in three real imaging genetics datasets and highlight some potential associations

    Data-driven modelling of biological multi-scale processes

    Full text link
    Biological processes involve a variety of spatial and temporal scales. A holistic understanding of many biological processes therefore requires multi-scale models which capture the relevant properties on all these scales. In this manuscript we review mathematical modelling approaches used to describe the individual spatial scales and how they are integrated into holistic models. We discuss the relation between spatial and temporal scales and the implication of that on multi-scale modelling. Based upon this overview over state-of-the-art modelling approaches, we formulate key challenges in mathematical and computational modelling of biological multi-scale and multi-physics processes. In particular, we considered the availability of analysis tools for multi-scale models and model-based multi-scale data integration. We provide a compact review of methods for model-based data integration and model-based hypothesis testing. Furthermore, novel approaches and recent trends are discussed, including computation time reduction using reduced order and surrogate models, which contribute to the solution of inference problems. We conclude the manuscript by providing a few ideas for the development of tailored multi-scale inference methods.Comment: This manuscript will appear in the Journal of Coupled Systems and Multiscale Dynamics (American Scientific Publishers

    Iterative Random Forests to detect predictive and stable high-order interactions

    Get PDF
    Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology
    corecore