296 research outputs found

    Pseudoalignment for metagenomic read assignment

    Get PDF
    Motivation: Read assignment is an important first step in many metagenomic analysis workflows, providing the basis for identification and quantification of species. However ambiguity among the sequences of many strains makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to taxonomic levels where they are unambiguous. We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data in order to develop novel methods for rapid and accurate quantification of metagenomic strains. Results: We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics projects

    Markov basis and Groebner basis of Segre-Veronese configuration for testing independence in group-wise selections

    Full text link
    We consider testing independence in group-wise selections with some restrictions on combinations of choices. We present models for frequency data of selections for which it is easy to perform conditional tests by Markov chain Monte Carlo (MCMC) methods. When the restrictions on the combinations can be described in terms of a Segre-Veronese configuration, an explicit form of a Gr\"obner basis consisting of moves of degree two is readily available for performing a Markov chain. We illustrate our setting with the National Center Test for university entrance examinations in Japan. We also apply our method to testing independence hypotheses involving genotypes at more than one locus or haplotypes of alleles on the same chromosome.Comment: 25 pages, 5 figure

    Lassoing and corraling rooted phylogenetic trees

    Full text link
    The construction of a dendogram on a set of individuals is a key component of a genomewide association study. However even with modern sequencing technologies the distances on the individuals required for the construction of such a structure may not always be reliable making it tempting to exclude them from an analysis. This, in turn, results in an input set for dendogram construction that consists of only partial distance information which raises the following fundamental question. For what subset of its leaf set can we reconstruct uniquely the dendogram from the distances that it induces on that subset. By formalizing a dendogram in terms of an edge-weighted, rooted phylogenetic tree on a pre-given finite set X with |X|>2 whose edge-weighting is equidistant and a set of partial distances on X in terms of a set L of 2-subsets of X, we investigate this problem in terms of when such a tree is lassoed, that is, uniquely determined by the elements in L. For this we consider four different formalizations of the idea of "uniquely determining" giving rise to four distinct types of lassos. We present characterizations for all of them in terms of the child-edge graphs of the interior vertices of such a tree. Our characterizations imply in particular that in case the tree in question is binary then all four types of lasso must coincide

    Recognizing Treelike k-Dissimilarities

    Full text link
    A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of size k subsets of X to the real numbers. Such maps naturally arise from edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is defined to be the total length of the smallest subtree of T with leaf-set Y . In case k = 2, it is well-known that 2-dissimilarities arising in this way can be characterized by the so-called "4-point condition". However, in case k > 2 Pachter and Speyer recently posed the following question: Given an arbitrary k-dissimilarity, how do we test whether this map comes from a tree? In this paper, we provide an answer to this question, showing that for k >= 3 a k-dissimilarity on a set X arises from a tree if and only if its restriction to every 2k-element subset of X arises from some tree, and that 2k is the least possible subset size to ensure that this is the case. As a corollary, we show that there exists a polynomial-time algorithm to determine when a k-dissimilarity arises from a tree. We also give a 6-point condition for determining when a 3-dissimilarity arises from a tree, that is similar to the aforementioned 4-point condition.Comment: 18 pages, 4 figure

    Likelihood Geometry

    Full text link
    We study the critical points of monomial functions over an algebraic subset of the probability simplex. The number of critical points on the Zariski closure is a topological invariant of that embedded projective variety, known as its maximum likelihood degree. We present an introduction to this theory and its statistical motivations. Many favorite objects from combinatorial algebraic geometry are featured: toric varieties, A-discriminants, hyperplane arrangements, Grassmannians, and determinantal varieties. Several new results are included, especially on the likelihood correspondence and its bidegree. These notes were written for the second author's lectures at the CIME-CIRM summer course on Combinatorial Algebraic Geometry at Levico Terme in June 2013.Comment: 45 pages; minor changes and addition

    Determinants of response to a parent questionnaire about development and behaviour in 3 year olds: European multicentre study of congenital toxoplasmosis.

    Get PDF
    Background: We aimed to determine how response to a parent-completed postal questionnaire measuring development, behaviour, impairment, and parental concerns and anxiety, varies in different European centres. Methods: Prospective cohort study of 3 year old children, with and without congenital toxoplasmosis, who were identified by prenatal or neonatal screening for toxoplasmosis in 11 centres in 7 countries. Parents were mailed a questionnaire that comprised all or part of existing validated tools. We determined the effect of characteristics of the centre and child on response, age at questionnaire completion, and response to child drawing tasks. Results: The questionnaire took 21 minutes to complete on average. 67% (714/1058) of parents responded. Few parents (60/1058) refused to participate. The strongest determinants of response were the score for organisational attributes of the study centre (such as direct involvement in follow up and access to an address register), and infection with congenital toxoplasmosis. Age at completion was associated with study centre, presence of neurological abnormalities in early infancy, and duration of prenatal treatment. Completion rates for individual questions exceeded 92% except for child completed drawings of a man (70%), which were completed more by girls, older children, and in certain centres. Conclusion: Differences in response across European centres were predominantly related to the organisation of follow up and access to correct addresses. The questionnaire was acceptable in all six countries and offers a low cost tool for assessing development, behaviour, and parental concerns and anxiety, in multinational studies

    Optimality regions and fluctuations for Bernoulli last passage models

    Get PDF
    We study the sequence alignment problem and its independent version,the discrete Hammersley process with an exploration penalty. We obtain rigorous upper bounds for the number of optimality regions in both models near the soft edge.At zero penalty the independent model becomes an exactly solvable model and we identify cases for which the law of the last passage time converges to a Tracy-Widom law

    Viral population estimation using pyrosequencing

    Get PDF
    The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an EM algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure
    • …
    corecore