545,319 research outputs found
Polyhedral geometry of Phylogenetic Rogue Taxa
It is well known among phylogeneticists that adding an extra taxon (e.g.
species) to a data set can alter the structure of the optimal phylogenetic tree
in surprising ways. However, little is known about this "rogue taxon" effect.
In this paper we characterize the behavior of balanced minimum evolution (BME)
phylogenetics on data sets of this type using tools from polyhedral geometry.
First we show that for any distance matrix there exist distances to a "rogue
taxon" such that the BME-optimal tree for the data set with the new taxon does
not contain any nontrivial splits (bipartitions) of the optimal tree for the
original data. Second, we prove a theorem which restricts the topology of
BME-optimal trees for data sets of this type, thus showing that a rogue taxon
cannot have an arbitrary effect on the optimal tree. Third, we construct
polyhedral cones computationally which give complete answers for BME rogue
taxon behavior when our original data fits a tree on four, five, and six taxa.
We use these cones to derive sufficient conditions for rogue taxon behavior for
four taxa, and to understand the frequency of the rogue taxon effect via
simulation.Comment: In this version, we add quartet distances and fix Table 4
RRB-Trees: Efficient Immutable Vectors
Immutable vectors are a convenient data structure for functional programming and part of the standard library of modern languages like Clojure and Scala. The common implementation is based on wide trees with a fixed number of children per node, which allows fast indexed lookup and update operations. In this paper we extend the vector data type with a new underlying data structure, Relaxed Radix Balanced Trees (RRB-Trees), and show how this structure allows immutable vector concatenation, insert-at and splits in O(log N) time while maintaining the index, update and iteration speeds of the original vector data structure
On the cost of fixed partial match queries in K-d trees
The final publication is available at Springer via http://dx.doi.org/10.1007/s00453-015-0097-4Partial match queries constitute the most basic type of associative queries in multidimensional data structures such as K-d trees or quadtrees. Given a query q=(q0,…,qK-1) where s of the coordinates are specified and K-s are left unspecified (qi=*), a partial match search returns the subset of data points x=(x0,…,xK-1) in the data structure that match the given query, that is, the data points such that xi=qi whenever qi¿*. There exists a wealth of results about the cost of partial match searches in many different multidimensional data structures, but most of these results deal with random queries. Only recently a few papers have begun to investigate the cost of partial match queries with a fixed query q. This paper represents a new contribution in this direction, giving a detailed asymptotic estimate of the expected cost Pn,q for a given fixed query q. From previous results on the cost of partial matches with a fixed query and the ones presented here, a deeper understanding is emerging, uncovering the following functional shape for Pn,q
Pn,q=¿·(¿i:qi is specifiedqi(1-qi))a/2·na+l.o.t.
(l.o.t. lower order terms, throughout this work) in many multidimensional data structures, which differ only in the exponent a and the constant ¿, both dependent on s and K, and, for some data structures, on the whole pattern of specified and unspecified coordinates in q as well. Although it is tempting to conjecture that this functional shape is “universal”, we have shown experimentally that it seems not to be true for a variant of K-d trees called squarish K-d trees.Peer ReviewedPostprint (author's final draft
RE-EM Trees: A New Data Mining Approach for Longitudinal Data
Longitudinal data refer to the situation where repeated observations are
available for each sampled individual. Methodologies that take this
structure into account allow for systematic differences between
individuals that are not related to covariates. A standard methodology
in the statistics literature for this type of data is the random effects
model, where these differences between individuals are represented by
so-called “effects” that are estimated from the data. This
paper presents a methodology that combines the flexibility of tree-based
estimation methods with the structure of random effects models for
longitudinal data. We apply the resulting estimation method, called the
RE-EM tree, to pricing in online transactions, showing that the RE-EM
tree is less sensitive to parametric assumptions and provides improved
predictive power compared to linear models with random effects and
regression trees without random effects. We also perform extensive
simulation experiments to show that the estimator improves predictive
performance relative to regression trees without random effects and is
comparable or superior to using linear models with random effects in
more general situations.Statistics Group, Information, Operations, and Management Science
Department, Stern School of Business, New York UniversityStatistics Working Papers Serie
Smooth heaps and a dual view of self-adjusting data structures
We present a new connection between self-adjusting binary search trees (BSTs)
and heaps, two fundamental, extensively studied, and practically relevant
families of data structures. Roughly speaking, we map an arbitrary heap
algorithm within a natural model, to a corresponding BST algorithm with the
same cost on a dual sequence of operations (i.e. the same sequence with the
roles of time and key-space switched). This is the first general transformation
between the two families of data structures.
There is a rich theory of dynamic optimality for BSTs (i.e. the theory of
competitiveness between BST algorithms). The lack of an analogous theory for
heaps has been noted in the literature. Through our connection, we transfer all
instance-specific lower bounds known for BSTs to a general model of heaps,
initiating a theory of dynamic optimality for heaps.
On the algorithmic side, we obtain a new, simple and efficient heap
algorithm, which we call the smooth heap. We show the smooth heap to be the
heap-counterpart of Greedy, the BST algorithm with the strongest proven and
conjectured properties from the literature, widely believed to be
instance-optimal. Assuming the optimality of Greedy, the smooth heap is also
optimal within our model of heap algorithms. As corollaries of results known
for Greedy, we obtain instance-specific upper bounds for the smooth heap, with
applications in adaptive sorting.
Intriguingly, the smooth heap, although derived from a non-practical BST
algorithm, is simple and easy to implement (e.g. it stores no auxiliary data
besides the keys and tree pointers). It can be seen as a variation on the
popular pairing heap data structure, extending it with a "power-of-two-choices"
type of heuristic.Comment: Presented at STOC 2018, light revision, additional figure
Foundations of the Wald Space for Phylogenetic Trees
Evolutionary relationships between species are represented by phylogenetic
trees, but these relationships are subject to uncertainty due to the random
nature of evolution. A geometry for the space of phylogenetic trees is
necessary in order to properly quantify this uncertainty during the statistical
analysis of collections of possible evolutionary trees inferred from biological
data. Recently, the wald space has been introduced: a length space for trees
which is a certain subset of the manifold of symmetric positive definite
matrices. In this work, the wald space is introduced formally and its topology
and structure is studied in detail. In particular, we show that wald space has
the topology of a disjoint union of open cubes, it is contractible, and by
careful characterization of cube boundaries, we demonstrate that wald space is
a Whitney stratified space of type (A). Imposing the metric induced by the
affine invariant metric on symmetric positive definite matrices, we prove that
wald space is a geodesic Riemann stratified space. A new numerical method is
proposed and investigated for construction of geodesics, computation of
Fr\'echet means and calculation of curvature in wald space. This work is
intended to serve as a mathematical foundation for further geometric and
statistical research on this space.Comment: 42 pages, 15 figure
STATISTICS ON MULTITYPE GALTON-WATSON TREES
In this work is proposed a statistical study of the multitype Galton-Watson trees in order to obtain data on their offspring distribution. The investigation is motivated by some parametric simplified models, based on particular two-type Galton-Watson trees, that we propose for the biological process called angiogenesis, i.e. the growth of new blood vessels. The basic idea of the models is to simplify the structure of a blood vessel as an union of its head and the body of the vessel itself. Moreover, the body of the vessel is conceived as an union of essential units, all with the same size. Then, we apply the structure of certain two-type Galton-Watson trees to the growth of a blood vessel, where the two-type particles are the heads and the essential units of a blood vessel respectively
From trees to networks and back
The evolutionary history of a set of species is commonly represented by a phylogenetic tree. Often, however, the data contain conflicting signals, which can be better represented by a more general structure, namely a phylogenetic network. Such networks allow the display of
several alternative evolutionary scenarios simultaneously but this can come at the price of complex visual representations. Using so-called circular split networks reduces this complexity, because this type of network can always be visualized in the plane without any crossing
edges. These circular split networks form the core of this thesis. We construct them, use them as a search space for minimum evolution trees and explore their properties.
More specifically, we present a new method, called SuperQ, to construct a circular split network summarising a collection of phylogenetic trees that have overlapping leaf sets. Then, we explore the set of phylogenetic trees associated with a �fixed circular split network, in particular using it as a search space for optimal trees. This set
represents just a tiny fraction of the space of all phylogenetic trees, but we still �find trees within it that compare quite favourably with those obtained by a leading heuristic, which uses tree edit operations for searching the whole tree space. In the last part, we advance our
understanding of the set of phylogenetic trees associated with a circular split network. Specifically, we investigate the size of the so-called circular tree neighbourhood for the three tree edit operations, tree bisection and reconnection (tbr), subtree prune and regraft (spr) and nearest neighbour interchange (nni)
- …