16 research outputs found

    Discriminative Topological Features Reveal Biological Network Mechanisms

    Get PDF
    Recent genomic and bioinformatic advances have motivated the development of numerous random network models purporting to describe graphs of biological, technological, and sociological origin. The success of a model has been evaluated by how well it reproduces a few key features of the real-world data, such as degree distributions, mean geodesic lengths, and clustering coefficients. Often pairs of models can reproduce these features with indistinguishable fidelity despite being generated by vastly different mechanisms. In such cases, these few target features are insufficient to distinguish which of the different models best describes real world networks of interest; moreover, it is not clear a priori that any of the presently-existing algorithms for network generation offers a predictive description of the networks inspiring them. To derive discriminative classifiers, we construct a mapping from the set of all graphs to a high-dimensional (in principle infinite-dimensional) ``word space.'' This map defines an input space for classification schemes which allow us for the first time to state unambiguously which models are most descriptive of the networks they purport to describe. Our training sets include networks generated from 17 models either drawn from the literature or introduced in this work, source code for which is freely available. We anticipate that this new approach to network analysis will be of broad impact to a number of communities.Comment: supplemental website: http://www.columbia.edu/itc/applied/wiggins/netclass

    Geoseq: a tool for dissecting deep-sequencing datasets

    Get PDF
    Gurtowski J, Cancio A, Shah H, et al. Geoseq: a tool for dissecting deep-sequencing datasets. BMC Bioinformatics. 2010;11(1): 506.Background Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Results Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Conclusions Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool

    Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA.

    No full text
    The confluence of deep sequencing and powerful machine learning is providing an unprecedented peek at the darkest of the dark genomic matter, the non-coding genomic regions lacking any functional annotation. While deep sequencing uncovers rare tumor variants, the heterogeneity of the disease confounds the best of machine learning (ML) algorithms. Here we set out to answer if the dark-matter of the genome encompass signals that can distinguish the fine subtypes of disease that are otherwise genomically indistinguishable. We introduce a novel stochastic regularization, ReVeaL, that empowers ML to discriminate subtle cancer subtypes even from the same 'cell of origin'. Analogous to heritability, implicitly defined on whole genome, we use predictability (F1 score) definable on portions of the genome. In an effort to distinguish cancer subtypes using dark-matter DNA, we applied ReVeaL to a new WGS dataset from 727 patient samples with seven forms of hematological cancers and assessed the predictivity over several genomic regions including genic, non-dark, non-coding, non-genic, and dark. ReVeaL enabled improved discrimination of cancer subtypes for all segments of the genome. The non-genic, non-coding and dark-matter had the highest F1 scores, with dark-matter having the highest level of predictability. Based on ReVeaL's predictability of different genomic regions, dark-matter contains enough signal to significantly discriminate fine subtypes of disease. Hence, the agglomeration of rare variants, even in the hitherto unannotated and ill-understood regions of the genome, may play a substantial role in the disease etiology and deserve much more attention

    Discriminative topological features reveal biological network mechanisms-0

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Discriminative topological features reveal biological network mechanisms"</p><p>BMC Bioinformatics 2004;5():181-181.</p><p>Published online 22 Nov 2004</p><p>PMCID:PMC535926.</p><p>Copyright © 2004 Middendorf et al; licensee BioMed Central Ltd.</p>and the Grindrod [17] model. is robustly classified as a Middendorf-Ziv network. The Grindrod model is the runner-up. We here show data for a word that especially the Middendorf-Ziv model over the Grindrod model. The histograms of the word over the training data are shown along with their associated densities calculated from the data by Gaussian kernel density estimation. The densities give the following log--values at the word value for the network: log() = -376, log() = -6.23

    Discriminative topological features reveal biological network mechanisms-1

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Discriminative topological features reveal biological network mechanisms"</p><p>BMC Bioinformatics 2004;5():181-181.</p><p>Published online 22 Nov 2004</p><p>PMCID:PMC535926.</p><p>Copyright © 2004 Middendorf et al; licensee BioMed Central Ltd.</p>rapivsky-Bianconi [18, 14] model. is robustly classified as a Kumar network. The Krapivsky-Bianconi model is the runner-up. We here show data for a word that especially the Kumar model over the Krapivsky-Bianconi model. The histograms of the word over the training data are shown along with their associated densities calculated from the data by Gaussian kernel density estimation. The densities give the following log--values at the word value for the network: log() = -4.22, log() = -12.0
    corecore