23 research outputs found

    The Mathematics of Phylogenomics

    Get PDF
    The grand challenges in biology today are being shaped by powerful high-throughput technologies that have revealed the genomes of many organisms, global expression patterns of genes and detailed information about variation within populations. We are therefore able to ask, for the first time, fundamental questions about the evolution of genomes, the structure of genes and their regulation, and the connections between genotypes and phenotypes of individuals. The answers to these questions are all predicated on progress in a variety of computational, statistical, and mathematical fields. The rapid growth in the characterization of genomes has led to the advancement of a new discipline called Phylogenomics. This discipline results from the combination of two major fields in the life sciences: Genomics, i.e., the study of the function and structure of genes and genomes; and Molecular Phylogenetics, i.e., the study of the hierarchical evolutionary relationships among organisms and their genomes. The objective of this article is to offer mathematicians a first introduction to this emerging field, and to discuss specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure

    The Mystery of Two Straight Lines in Bacterial Genome Statistics. Release 2007

    Full text link
    In special coordinates (codon position--specific nucleotide frequencies) bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position--specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean--field approximation, which is also known as context--free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.Comment: Significantly extended version with new data for all the 348 distinct bacterial genomes available in Genbank in April 200

    On the frontiers of polynomial computations in tropical geometry

    Full text link
    We study some basic algorithmic problems concerning the intersection of tropical hypersurfaces in general dimension: deciding whether this intersection is nonempty, whether it is a tropical variety, and whether it is connected, as well as counting the number of connected components. We characterize the borderline between tractable and hard computations by proving NP\mathcal{NP}-hardness and #P\mathcal{P}-hardness results under various strong restrictions of the input data, as well as providing polynomial time algorithms for various other restrictions.Comment: 17 pages, 5 figures. To appear in Journal of Symbolic Computatio

    Bounds on the number of inference functions of a graphical model

    Full text link
    Directed and undirected graphical models, also called Bayesian networks and Markov random fields, respectively, are important statistical tools in a wide variety of fields, ranging from computational biology to probabilistic artificial intelligence. We give an upper bound on the number of inference functions of any graphical model. This bound is polynomial on the size of the model, for a fixed number of parameters, thus improving the exponential upper bound given by Pachter and Sturmfels. We also show that our bound is tight up to a constant factor, by constructing a family of hidden Markov models whose number of inference functions agrees asymptotically with the upper bound. Finally, we apply this bound to a model for sequence alignment that is used in computational biology.Comment: 19 pages, 7 figure

    Computing medians and means in Hadamard spaces

    Full text link
    The geometric median as well as the Frechet mean of points in an Hadamard space are important in both theory and applications. Surprisingly, no algorithms for their computation are hitherto known. To address this issue, we use a split version of the proximal point algorithm for minimizing a sum of convex functions and prove that this algorithm produces a sequence converging to a minimizer of the objective function, which extends a recent result of D. Bertsekas (2001) into Hadamard spaces. The method is quite robust and not only does it yield algorithms for the median and the mean, but it also applies to various other optimization problems. We moreover show that another algorithm for computing the Frechet mean can be derived from the law of large numbers due to K.-T. Sturm (2002). In applications, computing medians and means is probably most needed in tree space, which is an instance of an Hadamard space, invented by Billera, Holmes, and Vogtmann (2001) as a tool for averaging phylogenetic trees. It turns out, however, that it can be also used to model numerous other tree-like structures. Since there now exists a polynomial-time algorithm for computing geodesics in tree space due to M. Owen and S. Provan (2011), we obtain efficient algorithms for computing medians and means, which can be directly used in practice.Comment: Corrected version. Accepted in SIAM Journal on Optimizatio
    corecore