1,704 research outputs found

    RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs

    Full text link
    Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model-free knockoffs procedure introduced recently in Cand\`{e}s, Fan, Janson and Lv (2016) in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-free knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptotically controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knockoffs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real data set is analyzed to further assess the performance of the suggested knockoffs procedure.Comment: 37 pages, 6 tables, 9 pages supplementary materia

    Nonparametric false discovery rate control for identifying simultaneous signals

    Get PDF
    It is frequently of interest to jointly analyze multiple sequences of multiple tests in order to identify simultaneous signals, defined as features tested in multiple studies whose test statistics are non-null in each. In many problems, however, the null distributions of the test statistics may be complicated or even unknown, and there do not currently exist any procedures that can be employed in these cases. This paper proposes a new nonparametric procedure that can identify simultaneous signals across multiple studies even without knowing the null distributions of the test statistics. The method is shown to asymptotically control the false discovery rate, and in simulations had excellent power and error control. In an analysis of gene expression and histone acetylation patterns in the brains of mice exposed to a conspecific intruder, it identified genes that were both differentially expressed and next to differentially accessible chromatin. The proposed method is available in the R package github.com/sdzhao/ssa

    On the Choice and Number of Microarrays for Transcriptional Regulatory Network Inference

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Transcriptional regulatory network inference (TRNI) from large compendia of DNA microarrays has become a fundamental approach for discovering transcription factor (TF)-gene interactions at the genome-wide level. In correlation-based TRNI, network edges can in principle be evaluated using standard statistical tests. However, while such tests nominally assume independent microarray experiments, we expect dependency between the experiments in microarray compendia, due to both project-specific factors (e.g., microarray preparation, environmental effects) in the multi-project compendium setting and effective dependency induced by gene-gene correlations. Herein, we characterize the nature of dependency in an <it>Escherichia coli </it>microarray compendium and explore its consequences on the problem of determining which and how many arrays to use in correlation-based TRNI.</p> <p>Results</p> <p>We present evidence of substantial effective dependency among microarrays in this compendium, and characterize that dependency with respect to experimental condition factors. We then introduce a measure <it>n</it><sub><it>eff </it></sub>of the effective number of experiments in a compendium, and find that corresponding to the dependency observed in this particular compendium there is a huge reduction in effective sample size i.e., <it>n</it><sub><it>eff </it></sub>= 14.7 versus <it>n </it>= 376. Furthermore, we found that the <it>n</it><sub><it>eff </it></sub>of select subsets of experiments actually exceeded <it>n</it><sub><it>eff </it></sub>of the full compendium, suggesting that the adage 'less is more' applies here. Consistent with this latter result, we observed improved performance in TRNI using subsets of the data compared to results using the full compendium. We identified experimental condition factors that trend with changes in TRNI performance and <it>n</it><sub><it>eff </it></sub>, including growth phase and media type. Finally, using the set of known E. coli genetic regulatory interactions from RegulonDB, we demonstrated that false discovery rates (FDR) derived from <it>n</it><sub><it>eff </it></sub>-adjusted p-values were well-matched to FDR based on the RegulonDB truth set.</p> <p>Conclusions</p> <p>These results support utilization of <it>n</it><sub><it>eff </it></sub>as a potent descriptor of microarray compendia. In addition, they highlight a straightforward correlation-based method for TRNI with demonstrated meaningful statistical testing for significant edges, readily applicable to compendia from any species, even when a truth set is not available. This work facilitates a more refined approach to construction and utilization of mRNA expression compendia in TRNI.</p

    Joint Multiple Testing Procedures for Graphical Model Selection with Applications to Biological Networks

    Get PDF
    Gaussian graphical models have become popular tools for identifying relationships between genes when analyzing microarray expression data. In the classical undirected Gaussian graphical model setting, conditional independence relationships can be inferred from partial correlations obtained from the concentration matrix (= inverse covariance matrix) when the sample size n exceeds the number of parameters p which need to estimated. In situations where n \u3c p, another approach to graphical model estimation may rely on calculating unconditional (zero-order) and first-order partial correlations. In these settings, the goal is to identify a lower-order conditional independence graph, sometimes referred to as a ‘0-1 graphs’. For either choice of graph, model selection may involve a multiple testing problem, in which edges in a graph are drawn only after rejecting hypotheses involving (saturated or lower-order) partial correlation parameters. Most multiple testing procedures applied in previously proposed graphical model selection algorithms rely on standard, marginal testing methods which do not take into account the joint distribution of the test statistics derived from (partial) correlations. We propose and implement a multiple testing framework useful when testing for edge inclusion during graphical model selection. Two features of our methodology include (i) a computationally efficient and asymptotically valid test statistics joint null distribution derived from influence curves for correlation-based parameters, and (ii) the application of empirical Bayes joint multiple testing procedures which can effectively control a variety of popular Type I error rates by incorpo- rating joint null distributions such as those described here (Dudoit and van der Laan, 2008). Using a dataset from Arabidopsis thaliana, we observe that the use of more sophisticated, modular approaches to multiple testing allows one to identify greater numbers of edges when approximating an undirected graphical model using a 0-1 graph. Our framework may also be extended to edge testing algorithms for other types of graphical models (e.g., for classical undirected, bidirected, and directed acyclic graphs)

    Analysis of Gene Sets Based on the Underlying Regulatory Network

    Full text link
    Networks are often used to represent the interactions among genes and proteins. These interactions are known to play an important role in vital cell functions and should be included in the analysis of genes that are differentially expressed. Methods of gene set analysis take advantage of external biological information and analyze a priori defined sets of genes. These methods can potentially preserve the correlation among genes; however, they do not directly incorporate the information about the gene network. In this paper, we propose a latent variable model that directly incorporates the network information. We then use the theory of mixed linear models to present a general inference framework for the problem of testing the significance of subnetworks. Several possible test procedures are introduced and a network based method for testing the changes in expression levels of genes as well as the structure of the network is presented. The performance of the proposed method is compared with methods of gene set analysis using both simulation studies, as well as real data on genes related to the galactose utilization pathway in yeast.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/78147/1/cmb.2008.0081.pd

    Estimation and Detection of Multivariate Gene Regulatory Relationships

    Get PDF
    The Coefficient of Determination (CoD) plays an important role in Genomics problems, for instance, in the inference of gene regulatory networks from gene- expression data. However, the inference theory about CoD has not been investigated systematically. In this dissertation, we study the inference of discrete CoD from both frequentist and Bayesian perspectives, with its applications to system identification problems in Genomics. From a frequentist viewpoint, we provide a theoretical framework for CoD estimation by introducing nonparametric CoD estimators and parametric maximum-likelihood (ML) CoD estimators based on static and dynamical Boolean models. Inference algorithms are developed to discover gene regulatory relationships, and numerical examples are provided to validate preferable performance of the ML approach with access to sufficient prior knowledge. To make the applications of the CoD independent of user-selectable thresholds, we describe rigorous multiple testing procedures to investigate significant regulatory relation- ships among genes using the discrete CoD, and to discover canalyzing genes using the intrinsically multivariate prediction (IMP) criterion. We develop practical statistic tools that are open to the scientific community. On the other hand, we propose a Bayesian framework for the inference of the CoD across a parametrized family of joint distributions between target and predictors. Examples of applications of the Bayesian approach are provided against those of nonparametric and parametric approaches by using synthetic data. We have found that, with applications to system identification problems in Genomics, both parametric and Bayesian CoD estimation approaches outperform the nonparametric approaches. Hence, we conclude that parametric and Bayesian estimation approaches are preferred when we have partial knowledge about gene regulation. On the other hand, we have shown that the two proposed statistical testing frameworks can detect well-known gene regulation and canalyzing genes like p53 and DUSP1 from real data sets, respectively. This indicates that our methodology could serve as a promising tool for the detection of potential gene regulatory relationships and canalyzing genes. In one word, this dissertation is intended to serve as foundation for a detailed study of applications of CoD estimation in Genomics and related fields

    Inferring and perturbing cell fate regulomes in human brain organoids

    Get PDF
    Self-organizing neural organoids grown from pluripotent stem cells(1-3) combined with single-cell genomic technologies provide opportunities to examine gene regulatory networks underlying human brain development. Here we acquire single-cell transcriptome and accessible chromatin data over a dense time course in human organoids covering neuroepithelial formation, patterning, brain regionalization and neurogenesis, and identify temporally dynamic and brain-region-specific regulatory regions. We developed Pando-a flexible framework that incorporates multi-omic data and predictions of transcription-factor-binding sites to infer a global gene regulatory network describing organoid development. We use pooled genetic perturbation with single-cell transcriptome readout to assess transcription factor requirement for cell fate and state regulation in organoids. We find that certain factors regulate the abundance of cell fates, whereas other factors affect neuronal cell states after differentiation. We show that the transcription factor GLI3 is required for cortical fate establishment in humans, recapitulating previous research performed in mammalian model systems. We measure transcriptome and chromatin accessibility in normal or GLI3-perturbed cells and identify two distinct GLI3 regulomes that are central to telencephalic fate decisions: one regulating dorsoventral patterning with HES4/5 as direct GLI3 targets, and one controlling ganglionic eminence diversification later in development. Together, we provide a framework for how human model systems and single-cell technologies can be leveraged to reconstruct human developmental biology
    • …
    corecore