180 research outputs found

    Tropical Principal Component Analysis and its Application to Phylogenetics

    Get PDF
    Principal component analysis is a widely-used method for the dimensionality reduction of a given data set in a high-dimensional Euclidean space. Here we define and analyze two analogues of principal component analysis in the setting of tropical geometry. In one approach, we study the Stiefel tropical linear space of fixed dimension closest to the data points in the tropical projective torus; in the other approach, we consider the tropical polytope with a fixed number of vertices closest to the data points. We then give approximative algorithms for both approaches and apply them to phylogenetics, testing the methods on simulated phylogenetic data and on an empirical dataset of Apicomplexa genomes.Comment: 28 page

    Testing metric properties

    Get PDF
    AbstractFinite metric spaces, and in particular tree metrics play an important role in various disciplines such as evolutionary biology and statistics. A natural family of problems concerning metrics is deciding, given a matrix M, whether or not it is a distance metric of a certain predetermined type. Here we consider the following relaxed version of such decision problems: For any given matrix M and parameter ϵ, we are interested in determining, by probing M, whether M has a particular metric property P, or whether it is ϵ-far from having the property. In ϵ-far we mean that at least an ϵ-fraction of the entries of M must be modified so that it obtains the property. The algorithm may query the matrix on entries M[i,j] of its choice, and is allowed a constant probability of error.We describe algorithms for testing Euclidean metrics, tree metrics and ultrametrics. Furthermore, we present an algorithm that tests whether a matrix M is an approximate ultrametric. In all cases the query complexity and running time are polynomial in 1/ϵ and independent of the size of the matrix. Finally, our algorithms can be used to solve relaxed versions of the corresponding search problems in time that is sub-linear in the size of the matrix

    Computational Molecular Biology

    No full text
    Computational Biology is a fairly new subject that arose in response to the computational problems posed by the analysis and the processing of biomolecular sequence and structure data. The field was initiated in the late 60's and early 70's largely by pioneers working in the life sciences. Physicists and mathematicians entered the field in the 70's and 80's, while Computer Science became involved with the new biological problems in the late 1980's. Computational problems have gained further importance in molecular biology through the various genome projects which produce enormous amounts of data. For this bibliography we focus on those areas of computational molecular biology that involve discrete algorithms or discrete optimization. We thus neglect several other areas of computational molecular biology, like most of the literature on the protein folding problem, as well as databases for molecular and genetic data, and genetic mapping algorithms. Due to the availability of review papers and a bibliography this bibliography

    Registering the evolutionary history in individual-based models of speciation

    Get PDF
    Understanding the emergence of biodiversity patterns in nature is a central problem in biology. Theoretical models of speciation have addressed this question in the macroecological scale, but little has been done to connect microevolutionary processes with macroevolutionary patterns. Knowledge of the evolutionary history allows the study of patterns underlying the processes being modeled, revealing their signatures and the role of speciation and extinction in shaping macroevolutionary patterns. In this paper we introduce two algorithms to record the evolutionary history of populations and species in individual-based models of speciation, from which genealogies and phylogenies can be constructed. The first algorithm relies on saving ancestor–descendant relationships, generating a matrix that contains the times to the most recent common ancestor between all pairs of individuals at every generation (the Most Recent Common Ancestor Time matrix, MRCAT). The second algorithm directly records all speciation and extinction events throughout the evolutionary process, generating a matrix with the true phylogeny of species (the Sequential Speciation and Extinction Events, SSEE). We illustrate the use of these algorithms in a spatially explicit individual-based model of speciation. We compare the trees generated via MRCAT and SSEE algorithms with trees inferred by methods that use only genetic distance between individuals of extant species, commonly used in empirical studies and applied here to simulated genetic data. Comparisons between trees are performed with metrics describing the overall topology, branch length distribution and imbalance degree. We observe that both MRCAT and distance-based trees differ from the true phylogeny, with the first being closer to the true tree than the second.Facultad de Ciencias Naturales y Muse

    The Haar Wavelet Transform of a Dendrogram: Additional Notes

    Get PDF
    We consider the wavelet transform of a finite, rooted, node-ranked, pp-way tree, focusing on the case of binary (p=2p = 2) trees. We study a Haar wavelet transform on this tree. Wavelet transforms allow for multiresolution analysis through translation and dilation of a wavelet function. We explore how this works in our tree context.Comment: 37 pp, 1 fig. Supplementary material to "The Haar Wavelet Transform of a Dendrogram", http://arxiv.org/abs/cs.IR/060810

    Unsupervised representation learning with Minimax distance measures

    Get PDF
    We investigate the use of Minimax distances to extract in a nonparametric way the features that capture the unknown underlying patterns and structures in the data. We develop a general-purpose and computationally efficient framework to employ Minimax distances with many machine learning methods that perform on numerical data. We study both computing the pairwise Minimax distances for all pairs of objects and as well as computing the Minimax distances of all the objects to/from a fixed (test) object. We first efficiently compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. Then, we perform an embedding of the pairwise Minimax distances into a new vector space, such that their squared Euclidean distances in the new space equal to the pairwise Minimax distances in the original space. We also study the case of having multiple pairwise Minimax matrices, instead of a single one. Thereby, we propose an embedding via first summing up the centered matrices and then performing an eigenvalue decomposition to obtain the relevant features. In the following, we study computing Minimax distances from a fixed (test) object which can be used for instance in K-nearest neighbor search. Similar to the case of all-pair pairwise Minimax distances, we develop an efficient and general-purpose algorithm that is applicable with any arbitrary base distance measure. Moreover, we investigate in detail the edges selected by the Minimax distances and thereby explore the ability of Minimax distances in detecting outlier objects. Finally, for each setting, we perform several experiments to demonstrate the effectiveness of our framework

    Tropical Geometry of Phylogenetic Tree Space: A Statistical Perspective

    Full text link
    Phylogenetic trees are the fundamental mathematical representation of evolutionary processes in biology. As data objects, they are characterized by the challenges associated with "big data," as well as the complication that their discrete geometric structure results in a non-Euclidean phylogenetic tree space, which poses computational and statistical limitations. We propose and study a novel framework to study sets of phylogenetic trees based on tropical geometry. In particular, we focus on characterizing our framework for statistical analyses of evolutionary biological processes represented by phylogenetic trees. Our setting exhibits analytic, geometric, and topological properties that are desirable for theoretical studies in probability and statistics, as well as increased computational efficiency over the current state-of-the-art. We demonstrate our approach on seasonal influenza data.Comment: 28 pages, 5 figures, 1 tabl
    corecore