2,379 research outputs found
Models for transcript quantification from RNA-Seq
RNA-Seq is rapidly becoming the standard technology for transcriptome
analysis. Fundamental to many of the applications of RNA-Seq is the
quantification problem, which is the accurate measurement of relative
transcript abundances from the sequenced reads. We focus on this problem, and
review many recently published models that are used to estimate the relative
abundances. In addition to describing the models and the different approaches
to inference, we also explain how methods are related to each other. A key
result is that we show how inference with many of the models results in
identical estimates of relative abundances, even though model formulations can
be very different. In fact, we are able to show how a single general model
captures many of the elements of previously published methods. We also review
the applications of RNA-Seq models to differential analysis, and explain why
accurate relative transcript abundance estimates are crucial for downstream
analyses
Selecting universities: personal preference and rankings
Polyhedral geometry can be used to quantitatively assess the dependence of
rankings on personal preference, and provides a tool for both students and
universities to assess US News and World Report rankings
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
Combinatorics of least squares trees
A recurring theme in the least squares approach to phylogenetics has been the
discovery of elegant combinatorial formulas for the least squares estimates of
edge lengths. These formulas have proved useful for the development of
efficient algorithms, and have also been important for understanding
connections among popular phylogeny algorithms. For example, the selection
criterion of the neighbor-joining algorithm is now understood in terms of the
combinatorial formulas of Pauplin for estimating tree length.
We highlight a phylogenetically desirable property that weighted least
squares methods should satisfy, and provide a complete characterization of
methods that satisfy the property. The necessary and sufficient condition is a
multiplicative four point condition that the the variance matrix needs to
satisfy. The proof is based on the observation that the Lagrange multipliers in
the proof of the Gauss--Markov theorem are tree-additive. Our results
generalize and complete previous work on ordinary least squares, balanced
minimum evolution and the taxon weighted variance model. They also provide a
time optimal algorithm for computation
MAVID: Constrained ancestral alignment of multiple sequences
We describe a new global multiple alignment program capable of aligning a
large number of genomic regions. Our progressive alignment approach
incorporates the following ideas: maximum-likelihood inference of ancestral
sequences, automatic guide-tree construction, protein based anchoring of
ab-initio gene predictions, and constraints derived from a global homology map
of the sequences. We have implemented these ideas in the MAVID program, which
is able to accurately align multiple genomic regions up to megabases long.
MAVID is able to effectively align divergent sequences, as well as incomplete
unfinished sequences. We demonstrate the capabilities of the program on the
benchmark CFTR region which consists of 1.8Mb of human sequence and 20
orthologous regions in marsupials, birds, fish, and mammals. Finally, we
describe two large MAVID alignments: an alignment of all the available HIV
genomes and a multiple alignment of the entire human, mouse and rat genomes
Reconstructing Trees from Subtree Weights
The tree-metric theorem provides a necessary and sufficient condition for a
dissimilarity matrix to be a tree metric, and has served as the foundation for
numerous distance-based reconstruction methods in phylogenetics. Our main
result is an extension of the tree-metric theorem to more general dissimilarity
maps. In particular, we show that a tree with n leaves is reconstructible from
the weights of the m-leaf subtrees provided that n \geq 2m-1
Towards the Human Genotope
The human genotope is the convex hull of all allele frequency vectors that
can be obtained from the genotypes present in the human population. In this
paper we take a few initial steps towards a description of this object, which
may be fundamental for future population based genetics studies. Here we use
data from the HapMap Project, restricted to two ENCODE regions, to study a
subpolytope of the human genotope. We study three different approaches for
obtaining informative low-dimensional projections of this subpolytope. The
projections are specified by projection onto few tag SNPs, principal component
analysis, and archetypal analysis. We describe the application of our geometric
approach to identifying structure in populations based on single nucleotide
polymorphisms
- …