28,251 research outputs found
Geometry Helps to Compare Persistence Diagrams
Exploiting geometric structure to improve the asymptotic complexity of
discrete assignment problems is a well-studied subject. In contrast, the
practical advantages of using geometry for such problems have not been
explored. We implement geometric variants of the Hopcroft--Karp algorithm for
bottleneck matching (based on previous work by Efrat el al.) and of the auction
algorithm by Bertsekas for Wasserstein distance computation. Both
implementations use k-d trees to replace a linear scan with a geometric
proximity query. Our interest in this problem stems from the desire to compute
distances between persistence diagrams, a problem that comes up frequently in
topological data analysis. We show that our geometric matching algorithms lead
to a substantial performance gain, both in running time and in memory
consumption, over their purely combinatorial counterparts. Moreover, our
implementation significantly outperforms the only other implementation
available for comparing persistence diagrams.Comment: 20 pages, 10 figures; extended version of paper published in ALENEX
201
Tropical Geometry of Phylogenetic Tree Space: A Statistical Perspective
Phylogenetic trees are the fundamental mathematical representation of
evolutionary processes in biology. As data objects, they are characterized by
the challenges associated with "big data," as well as the complication that
their discrete geometric structure results in a non-Euclidean phylogenetic tree
space, which poses computational and statistical limitations. We propose and
study a novel framework to study sets of phylogenetic trees based on tropical
geometry. In particular, we focus on characterizing our framework for
statistical analyses of evolutionary biological processes represented by
phylogenetic trees. Our setting exhibits analytic, geometric, and topological
properties that are desirable for theoretical studies in probability and
statistics, as well as increased computational efficiency over the current
state-of-the-art. We demonstrate our approach on seasonal influenza data.Comment: 28 pages, 5 figures, 1 tabl
The space of ultrametric phylogenetic trees
The reliability of a phylogenetic inference method from genomic sequence data
is ensured by its statistical consistency. Bayesian inference methods produce a
sample of phylogenetic trees from the posterior distribution given sequence
data. Hence the question of statistical consistency of such methods is
equivalent to the consistency of the summary of the sample. More generally,
statistical consistency is ensured by the tree space used to analyse the
sample.
In this paper, we consider two standard parameterisations of phylogenetic
time-trees used in evolutionary models: inter-coalescent interval lengths and
absolute times of divergence events. For each of these parameterisations we
introduce a natural metric space on ultrametric phylogenetic trees. We compare
the introduced spaces with existing models of tree space and formulate several
formal requirements that a metric space on phylogenetic trees must possess in
order to be a satisfactory space for statistical analysis, and justify them. We
show that only a few known constructions of the space of phylogenetic trees
satisfy these requirements. However, our results suggest that these basic
requirements are not enough to distinguish between the two metric spaces we
introduce and that the choice between metric spaces requires additional
properties to be considered. Particularly, that the summary tree minimising the
square distance to the trees from the sample might be different for different
parameterisations. This suggests that further fundamental insight is needed
into the problem of statistical consistency of phylogenetic inference methods.Comment: Minor changes. This version has been published in JTB. 27 pages, 9
figure
Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening
This work introduces a number of algebraic topology approaches, such as
multicomponent persistent homology, multi-level persistent homology and
electrostatic persistence for the representation, characterization, and
description of small molecules and biomolecular complexes. Multicomponent
persistent homology retains critical chemical and biological information during
the topological simplification of biomolecular geometric complexity.
Multi-level persistent homology enables a tailored topological description of
inter- and/or intra-molecular interactions of interest. Electrostatic
persistence incorporates partial charge information into topological
invariants. These topological methods are paired with Wasserstein distance to
characterize similarities between molecules and are further integrated with a
variety of machine learning algorithms, including k-nearest neighbors, ensemble
of trees, and deep convolutional neural networks, to manifest their descriptive
and predictive powers for chemical and biological problems. Extensive numerical
experiments involving more than 4,000 protein-ligand complexes from the PDBBind
database and near 100,000 ligands and decoys in the DUD database are performed
to test respectively the scoring power and the virtual screening power of the
proposed topological approaches. It is demonstrated that the present approaches
outperform the modern machine learning based methods in protein-ligand binding
affinity predictions and ligand-decoy discrimination
- …