1,561 research outputs found

    A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data

    Get PDF
    Microarray time course (MTC) gene expression data are commonly collected to study the dynamic nature of biological processes. One important problem is to identify genes that show different expression profiles over time and pathways that are perturbed during a given biological process. While methods are available to identify the genes with differential expression levels over time, there is a lack of methods that can incorporate the pathway information in identifying the pathways being modified/activated during a biological process. In this paper we develop a hidden spatial-temporal Markov random field (hstMRF)-based method for identifying genes and subnetworks that are related to biological processes, where the dependency of the differential expression patterns of genes on the networks are modeled over time and over the network of pathways. Simulation studies indicated that the method is quite effective in identifying genes and modified subnetworks and has higher sensitivity than the commonly used procedures that do not use the pathway structure or time dependency information, with similar false discovery rates. Application to a microarray gene expression study of systemic inflammation in humans identified a core set of genes on the KEGG pathways that show clear differential expression patterns over time. In addition, the method confirmed that the TOLL-like signaling pathway plays an important role in immune response to endotoxins.Comment: Published in at http://dx.doi.org/10.1214/07--AOAS145 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Nonparametric false discovery rate control for identifying simultaneous signals

    Get PDF
    It is frequently of interest to jointly analyze multiple sequences of multiple tests in order to identify simultaneous signals, defined as features tested in multiple studies whose test statistics are non-null in each. In many problems, however, the null distributions of the test statistics may be complicated or even unknown, and there do not currently exist any procedures that can be employed in these cases. This paper proposes a new nonparametric procedure that can identify simultaneous signals across multiple studies even without knowing the null distributions of the test statistics. The method is shown to asymptotically control the false discovery rate, and in simulations had excellent power and error control. In an analysis of gene expression and histone acetylation patterns in the brains of mice exposed to a conspecific intruder, it identified genes that were both differentially expressed and next to differentially accessible chromatin. The proposed method is available in the R package github.com/sdzhao/ssa

    A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining

    Full text link
    Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.Comment: 18 pages, 2 figures 3 table

    Listen to genes : dealing with microarray data in the frequency domain

    Get PDF
    Background: We present a novel and systematic approach to analyze temporal microarray data. The approach includes normalization, clustering and network analysis of genes. Methodology: Genes are normalized using an error model based uniform normalization method aimed at identifying and estimating the sources of variations. The model minimizes the correlation among error terms across replicates. The normalized gene expressions are then clustered in terms of their power spectrum density. The method of complex Granger causality is introduced to reveal interactions between sets of genes. Complex Granger causality along with partial Granger causality is applied in both time and frequency domains to selected as well as all the genes to reveal the interesting networks of interactions. The approach is successfully applied to Arabidopsis leaf microarray data generated from 31,000 genes observed over 22 time points over 22 days. Three circuits: a circadian gene circuit, an ethylene circuit and a new global circuit showing a hierarchical structure to determine the initiators of leaf senescence are analyzed in detail. Conclusions: We use a totally data-driven approach to form biological hypothesis. Clustering using the power-spectrum analysis helps us identify genes of potential interest. Their dynamics can be captured accurately in the time and frequency domain using the methods of complex and partial Granger causality. With the rise in availability of temporal microarray data, such methods can be useful tools in uncovering the hidden biological interactions. We show our method in a step by step manner with help of toy models as well as a real biological dataset. We also analyse three distinct gene circuits of potential interest to Arabidopsis researchers

    Random Forests: some methodological insights

    Get PDF
    This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size. But the main contribution of this paper is twofold: to provide some insights about the behavior of the variable importance index based on random forests and in addition, to propose to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The strategy involves a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy
    corecore