11 research outputs found

    Wide-Open: Accelerating public data release by automating detection of overdue datasets

    No full text
    <div><p>Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.</p></div

    Number of samples in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO).

    No full text
    <p>Data underlying the figure are available as <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.2002477#pbio.2002477.s001" target="_blank">S1 Data</a>.</p

    Number of Gene Expression Omnibus (GEO) datasets overdue for release over time, as detected by Wide-Open.

    No full text
    <p>Prior to this submission, we notified GEO of the standing list, which led to the dramatic drop of overdue datasets (magenta portion), with 400 datasets released within the first week. Data underlying the figure are available as <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.2002477#pbio.2002477.s002" target="_blank">S2 Data</a>.</p

    Average delay from submission to release in the Gene Expression Omnibus (GEO).

    No full text
    <p>Data underlying the figure are available as <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.2002477#pbio.2002477.s003" target="_blank">S3 Data</a>.</p

    The Kaplan-Meier plot showing differences in the survival rate measured in AML3 (A and B) and BRC3 (C and D) between the two patient groups with equal size, created based on the predicted survival time from each prediction model.

    No full text
    <p>We consider the model trained based on the top <i>N</i> (<i>N</i> = 1,351 for AML; <i>N</i> = 2,137 for BRC) DISCERN-scoring genes and clinical covariates (blue), and the model trained based on only clinical covariates (red) (panels A and C for AML3 and BRC3, respectively). (B) The panel shows the comparison with the model trained using genes comprising 22 genes previously known prognostic marker, called LSC [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004888#pcbi.1004888.ref054" target="_blank">54</a>], along with the clinical covariates (red). (D) The panel shows the comparison with the model trained using 67 genes from the MammaPrint prognostic marker (70 genes) [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004888#pcbi.1004888.ref083" target="_blank">83</a>] along with the clinical covariates. We used 67 genes out of 70 genes that are present in our BRC expression datasets. P-values shown in each plot are based on the logrank test (red).</p

    Identifying Network Perturbation in Cancer

    No full text
    <div><p>We present a computational framework, called DISCERN (<b>DI</b>fferential <b>S</b>pars<b>E</b> <b>R</b>egulatory <b>N</b>etwork), to identify informative topological changes in gene-regulator dependence networks inferred on the basis of mRNA expression datasets within distinct biological states. DISCERN takes two expression datasets as input: an expression dataset of diseased tissues from patients with a disease of interest and another expression dataset from matching normal tissues. DISCERN estimates the extent to which each gene is <i>perturbed</i>—having distinct regulator connectivity in the inferred gene-regulator dependencies between the disease and normal conditions. This approach has distinct advantages over existing methods. First, DISCERN infers <i>conditional dependencies</i> between candidate regulators and genes, where conditional dependence relationships discriminate the evidence for direct interactions from indirect interactions more precisely than pairwise correlation. Second, DISCERN uses a new likelihood-based scoring function to alleviate concerns about accuracy of the specific edges inferred in a particular network. DISCERN identifies perturbed genes more accurately in synthetic data than existing methods to identify perturbed genes between distinct states. In expression datasets from patients with acute myeloid leukemia (AML), breast cancer and lung cancer, genes with high DISCERN scores in each cancer are enriched for known tumor drivers, genes associated with the biological processes known to be important in the disease, and genes associated with patient prognosis, in the respective cancer. Finally, we show that DISCERN can uncover potential mechanisms underlying network perturbation by explaining observed epigenomic activity patterns in cancer and normal tissue types more accurately than alternative methods, based on the available epigenomic data from the ENCODE project.</p></div

    Identifying Network Perturbation in Cancer - Fig 2

    No full text
    <p><b>(A) Average receiver operating characteristic (ROC) curves from the experiments on synthetic data.</b> We compare DISCERN with 7 alternative methods: 3 existing methods—LNS [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004888#pcbi.1004888.ref035" target="_blank">35</a>], D-score [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004888#pcbi.1004888.ref036" target="_blank">36</a>], and PLSNet [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004888#pcbi.1004888.ref034" target="_blank">34</a>]—and 4 methods we developed for comparison—pLNS, pD-score, <i>D</i><sup>0</sup> and p<i>D</i><sup>0</sup>. (B) Comparison of the runtime (hours) between PLSNet and DISCERN for varying numbers of variables (<i>p</i>). The triangles mean the measured run times over specific values of <i>p</i>, and lines connect these measured run times. PLSNet uses the empirical p-values from permutation tests as scores, and DISCERN does not. For a large value of <i>p</i>, DISCERN is two to three orders of magnitude faster than PLSNet.</p

    Identifying Network Perturbation in Cancer - Fig 1

    No full text
    <p><b>(A) A simple hypothetical example that illustrates the perturbation of a network of 7 genes between disease and normal tissues.</b> One possible cause of the perturbation is a cancer driver mutation on gene ‘1’ that alters the interactions between gene ‘1’ and genes ‘3’, ‘4’, ‘5’, and ‘6’. (B) One possible cause of network perturbation. Gene ‘1’ is regulated by different sets of genes between cancer and normal conditions. (C) The overview of our approach. DISCERN takes two expression datasets as input: an expression dataset from patients with a disease of interest and another expression dataset from normal tissues (top). DISCERN computes the network perturbation score for each gene that estimates the difference in connection between the gene and other genes between disease and normal conditions (middle). We perform various post-analyses to evaluate the DISCERN method by comparing with alternative methods, based on the importance of the high-scoring genes in the disease through a survival analysis and on how well the identified perturbed genes explain the observed epigenomic activity data (bottom).</p

    The significance of the enrichment for survival-associated genes in the identified perturbed genes.

    No full text
    <p>We compared DISCERN with LNS and D-score based on the Fisher’s exact test p-value that measures the significance of the overlap between <i>N</i> top-scoring genes and survival-associated genes in each of three cancers. (A)-(C) We plotted −log<sub>10</sub>(p-value) from the Fisher’s exact test when <i>N</i> top-scoring genes were considered by each method in 3 datasets: (A) AML (<i>N</i> = 1,351), (B) BRC (<i>N</i> = 2,137), and (C) LUAD (<i>N</i> = 3,836). For ANOVA, we considered 8,993 genes (AML), 7,922 genes (BRC) and 13,344 genes (LUAD) that show significant differential expression at FDR corrected p-value < 0.05. (D)-(F) We consider up to 1,500 (AML), 2,500 (BRC), and 4,000 (LUAD) top-scoring genes in each method, to show that DISCERN is better than LNS and D-score in a range of <i>N</i> value. The red-colored dotted line indicates 1,351 genes (AML), 2,137 genes (BRC), and 3,836 genes (LUAD) that are identified to be significantly perturbed by DISCERN (FDR < 0.05). We compare among the 4 methods consisting of 3 methods to identify network perturbed genes (solid lines) and ANOVA for identifying differentially expressed genes (dotted line) in 3 cancer types.</p
    corecore