296 research outputs found
Pseudoalignment for metagenomic read assignment
Motivation: Read assignment is an important first step in many metagenomic analysis workflows, providing the basis for identification and quantification of species. However ambiguity among the sequences of many strains makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to taxonomic levels where they are unambiguous. We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data in order to develop novel methods for rapid and accurate quantification of metagenomic strains.
Results: We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics projects
Markov basis and Groebner basis of Segre-Veronese configuration for testing independence in group-wise selections
We consider testing independence in group-wise selections with some
restrictions on combinations of choices. We present models for frequency data
of selections for which it is easy to perform conditional tests by Markov chain
Monte Carlo (MCMC) methods. When the restrictions on the combinations can be
described in terms of a Segre-Veronese configuration, an explicit form of a
Gr\"obner basis consisting of moves of degree two is readily available for
performing a Markov chain. We illustrate our setting with the National Center
Test for university entrance examinations in Japan. We also apply our method to
testing independence hypotheses involving genotypes at more than one locus or
haplotypes of alleles on the same chromosome.Comment: 25 pages, 5 figure
Lassoing and corraling rooted phylogenetic trees
The construction of a dendogram on a set of individuals is a key component of
a genomewide association study. However even with modern sequencing
technologies the distances on the individuals required for the construction of
such a structure may not always be reliable making it tempting to exclude them
from an analysis. This, in turn, results in an input set for dendogram
construction that consists of only partial distance information which raises
the following fundamental question. For what subset of its leaf set can we
reconstruct uniquely the dendogram from the distances that it induces on that
subset. By formalizing a dendogram in terms of an edge-weighted, rooted
phylogenetic tree on a pre-given finite set X with |X|>2 whose edge-weighting
is equidistant and a set of partial distances on X in terms of a set L of
2-subsets of X, we investigate this problem in terms of when such a tree is
lassoed, that is, uniquely determined by the elements in L. For this we
consider four different formalizations of the idea of "uniquely determining"
giving rise to four distinct types of lassos. We present characterizations for
all of them in terms of the child-edge graphs of the interior vertices of such
a tree. Our characterizations imply in particular that in case the tree in
question is binary then all four types of lasso must coincide
Recognizing Treelike k-Dissimilarities
A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of
size k subsets of X to the real numbers. Such maps naturally arise from
edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is
defined to be the total length of the smallest subtree of T with leaf-set Y .
In case k = 2, it is well-known that 2-dissimilarities arising in this way can
be characterized by the so-called "4-point condition". However, in case k > 2
Pachter and Speyer recently posed the following question: Given an arbitrary
k-dissimilarity, how do we test whether this map comes from a tree? In this
paper, we provide an answer to this question, showing that for k >= 3 a
k-dissimilarity on a set X arises from a tree if and only if its restriction to
every 2k-element subset of X arises from some tree, and that 2k is the least
possible subset size to ensure that this is the case. As a corollary, we show
that there exists a polynomial-time algorithm to determine when a
k-dissimilarity arises from a tree. We also give a 6-point condition for
determining when a 3-dissimilarity arises from a tree, that is similar to the
aforementioned 4-point condition.Comment: 18 pages, 4 figure
Likelihood Geometry
We study the critical points of monomial functions over an algebraic subset
of the probability simplex. The number of critical points on the Zariski
closure is a topological invariant of that embedded projective variety, known
as its maximum likelihood degree. We present an introduction to this theory and
its statistical motivations. Many favorite objects from combinatorial algebraic
geometry are featured: toric varieties, A-discriminants, hyperplane
arrangements, Grassmannians, and determinantal varieties. Several new results
are included, especially on the likelihood correspondence and its bidegree.
These notes were written for the second author's lectures at the CIME-CIRM
summer course on Combinatorial Algebraic Geometry at Levico Terme in June 2013.Comment: 45 pages; minor changes and addition
Determinants of response to a parent questionnaire about development and behaviour in 3 year olds: European multicentre study of congenital toxoplasmosis.
Background:
We aimed to determine how response to a parent-completed postal questionnaire measuring development, behaviour, impairment, and parental concerns and anxiety, varies in different European centres.
Methods:
Prospective cohort study of 3 year old children, with and without congenital toxoplasmosis, who were identified by prenatal or neonatal screening for toxoplasmosis in 11 centres in 7 countries. Parents were mailed a questionnaire that comprised all or part of existing validated tools. We determined the effect of characteristics of the centre and child on response, age at questionnaire completion, and response to child drawing tasks.
Results:
The questionnaire took 21 minutes to complete on average. 67% (714/1058) of parents responded. Few parents (60/1058) refused to participate. The strongest determinants of response were the score for organisational attributes of the study centre (such as direct involvement in follow up and access to an address register), and infection with congenital toxoplasmosis. Age at completion was associated with study centre, presence of neurological abnormalities in early infancy, and duration of prenatal treatment. Completion rates for individual questions exceeded 92% except for child completed drawings of a man (70%), which were completed more by girls, older children, and in certain centres.
Conclusion:
Differences in response across European centres were predominantly related to the organisation of follow up and access to correct addresses. The questionnaire was acceptable in all six countries and offers a low cost tool for assessing development, behaviour, and parental concerns and anxiety, in multinational studies
Optimality regions and fluctuations for Bernoulli last passage models
We study the sequence alignment problem and its independent version,the discrete Hammersley process with an exploration penalty.
We obtain rigorous upper bounds for the number of optimality regions in both models near the soft edge.At zero penalty the independent model becomes an exactly solvable model and we identify cases for which the law of the last passage time converges to a Tracy-Widom law
Viral population estimation using pyrosequencing
The diversity of virus populations within single infected hosts presents a
major difficulty for the natural immune response as well as for vaccine design
and antiviral drug therapy. Recently developed pyrophosphate based sequencing
technologies (pyrosequencing) can be used for quantifying this diversity by
ultra-deep sequencing of virus samples. We present computational methods for
the analysis of such sequence data and apply these techniques to pyrosequencing
data obtained from HIV populations within patients harboring drug resistant
virus strains. Our main result is the estimation of the population structure of
the sample from the pyrosequencing reads. This inference is based on a
statistical approach to error correction, followed by a combinatorial algorithm
for constructing a minimal set of haplotypes that explain the data. Using this
set of explaining haplotypes, we apply a statistical model to infer the
frequencies of the haplotypes in the population via an EM algorithm. We
demonstrate that pyrosequencing reads allow for effective population
reconstruction by extensive simulations and by comparison to 165 sequences
obtained directly from clonal sequencing of four independent, diverse HIV
populations. Thus, pyrosequencing can be used for cost-effective estimation of
the structure of virus populations, promising new insights into viral
evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure
Optimal Path Planning for Unmanned Combat Aerial Vehicles to Defeat Radar Tracking
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/76140/1/AIAA-14303-218.pd
- …