23 research outputs found
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
The Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007
In special coordinates (codon position--specific nucleotide frequencies)
bacterial genomes form two straight lines in 9-dimensional space: one line for
eubacterial genomes, another for archaeal genomes. All the 348 distinct
bacterial genomes available in Genbank in April 2007, belong to these lines
with high accuracy. The main challenge now is to explain the observed high
accuracy. The new phenomenon of complementary symmetry for codon
position--specific nucleotide frequencies is observed. The results of analysis
of several codon usage models are presented. We demonstrate that the
mean--field approximation, which is also known as context--free, or complete
independence model, or Segre variety, can serve as a reasonable approximation
to the real codon usage. The first two principal components of codon usage
correlate strongly with genomic G+C content and the optimal growth temperature
respectively. The variation of codon usage along the third component is related
to the curvature of the mean-field approximation. First three eigenvalues in
codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and
archaeal genomes codon usage is clearly distributed along two third order
curves with genomic G+C content as a parameter.Comment: Significantly extended version with new data for all the 348 distinct
bacterial genomes available in Genbank in April 200
On the frontiers of polynomial computations in tropical geometry
We study some basic algorithmic problems concerning the intersection of
tropical hypersurfaces in general dimension: deciding whether this intersection
is nonempty, whether it is a tropical variety, and whether it is connected, as
well as counting the number of connected components. We characterize the
borderline between tractable and hard computations by proving
-hardness and #-hardness results under various
strong restrictions of the input data, as well as providing polynomial time
algorithms for various other restrictions.Comment: 17 pages, 5 figures. To appear in Journal of Symbolic Computatio
Bounds on the number of inference functions of a graphical model
Directed and undirected graphical models, also called Bayesian networks and
Markov random fields, respectively, are important statistical tools in a wide
variety of fields, ranging from computational biology to probabilistic
artificial intelligence. We give an upper bound on the number of inference
functions of any graphical model. This bound is polynomial on the size of the
model, for a fixed number of parameters, thus improving the exponential upper
bound given by Pachter and Sturmfels. We also show that our bound is tight up
to a constant factor, by constructing a family of hidden Markov models whose
number of inference functions agrees asymptotically with the upper bound.
Finally, we apply this bound to a model for sequence alignment that is used in
computational biology.Comment: 19 pages, 7 figure
Computing medians and means in Hadamard spaces
The geometric median as well as the Frechet mean of points in an Hadamard
space are important in both theory and applications. Surprisingly, no
algorithms for their computation are hitherto known. To address this issue, we
use a split version of the proximal point algorithm for minimizing a sum of
convex functions and prove that this algorithm produces a sequence converging
to a minimizer of the objective function, which extends a recent result of D.
Bertsekas (2001) into Hadamard spaces. The method is quite robust and not only
does it yield algorithms for the median and the mean, but it also applies to
various other optimization problems. We moreover show that another algorithm
for computing the Frechet mean can be derived from the law of large numbers due
to K.-T. Sturm (2002). In applications, computing medians and means is probably
most needed in tree space, which is an instance of an Hadamard space, invented
by Billera, Holmes, and Vogtmann (2001) as a tool for averaging phylogenetic
trees. It turns out, however, that it can be also used to model numerous other
tree-like structures. Since there now exists a polynomial-time algorithm for
computing geodesics in tree space due to M. Owen and S. Provan (2011), we
obtain efficient algorithms for computing medians and means, which can be
directly used in practice.Comment: Corrected version. Accepted in SIAM Journal on Optimizatio