416 research outputs found

    Coverage statistics for sequence census methods

    Get PDF
    Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essential for constructing models of current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: We show that regardless of fragment length distribution and under the mild assumption that fragment start sites are Poisson distributed, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the jump skeleton of the the coverage function, and show that the induced trees are Galton-Watson trees whose parameters can be computed. Conclusions: Our results extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. By focusing on fragments, we are also led to a new approach for visualizing sequencing data that should be of independent interest.Comment: 10 pages, 4 figure

    Pseudoalignment for metagenomic read assignment

    Get PDF
    Motivation: Read assignment is an important first step in many metagenomic analysis workflows, providing the basis for identification and quantification of species. However ambiguity among the sequences of many strains makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to taxonomic levels where they are unambiguous. We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data in order to develop novel methods for rapid and accurate quantification of metagenomic strains. Results: We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics projects

    A Study of Water Clusters Using the Effective Fragment Potential and Monte Carlo Simulated Annealing

    Get PDF
    Simulated annealing methods have been used with the effective fragment potential to locate the lowest energy structures for the water clusters (H2O)n with n=6, 8, 10, 12, 14, 16, 18, and 20. The most successful method uses a local minimization on each Monte Carlo step. The effective fragment potential method yielded interaction energies in excellent agreement with those calculated at the ab initio Hartree–Fock level and was quite successful at predicting the same energy ordering as the higher-level perturbation theory and coupled cluster methods. Analysis of the molecular interaction energies in terms of its electrostatic,polarization, and exchange-repulsion/charge-transfer components reveals that the electrostatic contribution is the dominant term in determining the energy ordering of the minima on the (H2O)n potential energy surfaces, but that differences in the polarization and repulsion components can be important in some cases

    NNT pseudoexon activation as a novel mechanism for disease in two siblings with familial glucocorticoid deficiency

    Get PDF
    CONTEXT: Intronic DNA frequently encodes potential exonic sequences called pseudoexons. In recent years, mutations resulting in aberrant pseudoexon inclusion have been increasingly recognized to cause disease. OBJECTIVES: To find the genetic cause of familial glucocorticoid deficiency (FGD) in two siblings. PATIENTS: The proband and his affected sibling, from nonconsanguineous parents of East Asian and South African origin, were diagnosed with FGD at the ages of 21 and 8 months, respectively. DESIGN: Whole exome sequencing was performed on genomic DNA (gDNA) of the siblings. Variants in genes known to cause FGD were assessed for causality. Further analysis of gDNA and cDNA was performed by PCR/RT-PCR followed by automated Sanger sequencing. RESULTS: Whole exome sequencing identified a single, novel heterozygous variant (p.Arg71*) in nicotinamide nucleotide transhydrogenase (NNT) in both affected individuals. Follow-up cDNA analysis in the proband identified a 69-bp pseudoexon inclusion event, and Sanger sequencing of his gDNA identified a 4-bp duplication responsible for its activation. The variants segregated with the disease: p.Arg71* was inherited from the mother, the pseudoexon change was inherited from the father, and an unaffected sibling had inherited only the p.Arg71* variant. CONCLUSIONS: FGD in these siblings is caused by compound heterozygous mutations in NNT; one causing pseudoexon inclusion in combination with another leading to Arg71*. Discovery of this pseudoexon activation mutation highlights the importance of identifying sequence changes in introns by cDNA analysis. The clinical implications of these findings include: facilitation of antenatal genetic diagnosis, early institution of potentially lifesaving therapy, and the possibility of preventative or curative interventio

    The Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007

    Full text link
    In special coordinates (codon position--specific nucleotide frequencies) bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position--specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean--field approximation, which is also known as context--free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.Comment: Significantly extended version with new data for all the 348 distinct bacterial genomes available in Genbank in April 200

    Markov basis and Groebner basis of Segre-Veronese configuration for testing independence in group-wise selections

    Full text link
    We consider testing independence in group-wise selections with some restrictions on combinations of choices. We present models for frequency data of selections for which it is easy to perform conditional tests by Markov chain Monte Carlo (MCMC) methods. When the restrictions on the combinations can be described in terms of a Segre-Veronese configuration, an explicit form of a Gr\"obner basis consisting of moves of degree two is readily available for performing a Markov chain. We illustrate our setting with the National Center Test for university entrance examinations in Japan. We also apply our method to testing independence hypotheses involving genotypes at more than one locus or haplotypes of alleles on the same chromosome.Comment: 25 pages, 5 figure

    Recognizing Treelike k-Dissimilarities

    Full text link
    A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of size k subsets of X to the real numbers. Such maps naturally arise from edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is defined to be the total length of the smallest subtree of T with leaf-set Y . In case k = 2, it is well-known that 2-dissimilarities arising in this way can be characterized by the so-called "4-point condition". However, in case k > 2 Pachter and Speyer recently posed the following question: Given an arbitrary k-dissimilarity, how do we test whether this map comes from a tree? In this paper, we provide an answer to this question, showing that for k >= 3 a k-dissimilarity on a set X arises from a tree if and only if its restriction to every 2k-element subset of X arises from some tree, and that 2k is the least possible subset size to ensure that this is the case. As a corollary, we show that there exists a polynomial-time algorithm to determine when a k-dissimilarity arises from a tree. We also give a 6-point condition for determining when a 3-dissimilarity arises from a tree, that is similar to the aforementioned 4-point condition.Comment: 18 pages, 4 figure

    Shape-based peak identification for ChIP-Seq

    Get PDF
    We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://math.berkeley.edu/~vhower/tpic.htmlComment: 12 pages, 6 figure

    Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts

    Get PDF
    Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays
    • …
    corecore