16 research outputs found
Discriminative Topological Features Reveal Biological Network Mechanisms
Recent genomic and bioinformatic advances have motivated the development of
numerous random network models purporting to describe graphs of biological,
technological, and sociological origin. The success of a model has been
evaluated by how well it reproduces a few key features of the real-world data,
such as degree distributions, mean geodesic lengths, and clustering
coefficients. Often pairs of models can reproduce these features with
indistinguishable fidelity despite being generated by vastly different
mechanisms. In such cases, these few target features are insufficient to
distinguish which of the different models best describes real world networks of
interest; moreover, it is not clear a priori that any of the presently-existing
algorithms for network generation offers a predictive description of the
networks inspiring them. To derive discriminative classifiers, we construct a
mapping from the set of all graphs to a high-dimensional (in principle
infinite-dimensional) ``word space.'' This map defines an input space for
classification schemes which allow us for the first time to state unambiguously
which models are most descriptive of the networks they purport to describe. Our
training sets include networks generated from 17 models either drawn from the
literature or introduced in this work, source code for which is freely
available. We anticipate that this new approach to network analysis will be of
broad impact to a number of communities.Comment: supplemental website:
http://www.columbia.edu/itc/applied/wiggins/netclass
Geoseq: a tool for dissecting deep-sequencing datasets
Gurtowski J, Cancio A, Shah H, et al. Geoseq: a tool for dissecting deep-sequencing datasets. BMC Bioinformatics. 2010;11(1): 506.Background
Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest.
Results
Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment.
Conclusions
Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool
Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA.
The confluence of deep sequencing and powerful machine learning is providing an unprecedented peek at the darkest of the dark genomic matter, the non-coding genomic regions lacking any functional annotation. While deep sequencing uncovers rare tumor variants, the heterogeneity of the disease confounds the best of machine learning (ML) algorithms. Here we set out to answer if the dark-matter of the genome encompass signals that can distinguish the fine subtypes of disease that are otherwise genomically indistinguishable. We introduce a novel stochastic regularization, ReVeaL, that empowers ML to discriminate subtle cancer subtypes even from the same 'cell of origin'. Analogous to heritability, implicitly defined on whole genome, we use predictability (F1 score) definable on portions of the genome. In an effort to distinguish cancer subtypes using dark-matter DNA, we applied ReVeaL to a new WGS dataset from 727 patient samples with seven forms of hematological cancers and assessed the predictivity over several genomic regions including genic, non-dark, non-coding, non-genic, and dark. ReVeaL enabled improved discrimination of cancer subtypes for all segments of the genome. The non-genic, non-coding and dark-matter had the highest F1 scores, with dark-matter having the highest level of predictability. Based on ReVeaL's predictability of different genomic regions, dark-matter contains enough signal to significantly discriminate fine subtypes of disease. Hence, the agglomeration of rare variants, even in the hitherto unannotated and ill-understood regions of the genome, may play a substantial role in the disease etiology and deserve much more attention
Discriminative topological features reveal biological network mechanisms-0
<p><b>Copyright information:</b></p><p>Taken from "Discriminative topological features reveal biological network mechanisms"</p><p>BMC Bioinformatics 2004;5():181-181.</p><p>Published online 22 Nov 2004</p><p>PMCID:PMC535926.</p><p>Copyright © 2004 Middendorf et al; licensee BioMed Central Ltd.</p>and the Grindrod [17] model. is robustly classified as a Middendorf-Ziv network. The Grindrod model is the runner-up. We here show data for a word that especially the Middendorf-Ziv model over the Grindrod model. The histograms of the word over the training data are shown along with their associated densities calculated from the data by Gaussian kernel density estimation. The densities give the following log--values at the word value for the network: log() = -376, log() = -6.23
Discriminative topological features reveal biological network mechanisms-1
<p><b>Copyright information:</b></p><p>Taken from "Discriminative topological features reveal biological network mechanisms"</p><p>BMC Bioinformatics 2004;5():181-181.</p><p>Published online 22 Nov 2004</p><p>PMCID:PMC535926.</p><p>Copyright © 2004 Middendorf et al; licensee BioMed Central Ltd.</p>rapivsky-Bianconi [18, 14] model. is robustly classified as a Kumar network. The Krapivsky-Bianconi model is the runner-up. We here show data for a word that especially the Kumar model over the Krapivsky-Bianconi model. The histograms of the word over the training data are shown along with their associated densities calculated from the data by Gaussian kernel density estimation. The densities give the following log--values at the word value for the network: log() = -4.22, log() = -12.0