21,717 research outputs found
Balanced binary trees in the Tamari lattice
We show that the set of balanced binary trees is closed by interval in the
Tamari lattice. We establish that the intervals [T0, T1] where T0 and T1 are
balanced trees are isomorphic as posets to a hypercube. We introduce tree
patterns and synchronous grammars to get a functional equation of the
generating series enumerating balanced tree intervals
The mean, variance and limiting distribution of two statistics sensitive to phylogenetic tree balance
For two decades, the Colless index has been the most frequently used
statistic for assessing the balance of phylogenetic trees. In this article,
this statistic is studied under the Yule and uniform model of phylogenetic
trees. The main tool of analysis is a coupling argument with another well-known
index called the Sackin statistic. Asymptotics for the mean, variance and
covariance of these two statistics are obtained, as well as their limiting
joint distribution for large phylogenies. Under the Yule model, the limiting
distribution arises as a solution of a functional fixed point equation. Under
the uniform model, the limiting distribution is the Airy distribution. The
cornerstone of this study is the fact that the probabilistic models for
phylogenetic trees are strongly related to the random permutation and the
Catalan models for binary search trees.Comment: Published at http://dx.doi.org/10.1214/105051606000000547 in the
Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute
of Mathematical Statistics (http://www.imstat.org
Recommended from our members
Dynamic load balancing in parallel KD-tree k-means
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis.
Techniques for improving the efficiency of k-Means have been
largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing
issue. Three solutions have been developed and tested. Two
approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy
Yule-generated trees constrained by node imbalance
The Yule process generates a class of binary trees which is fundamental to
population genetic models and other applications in evolutionary biology. In
this paper, we introduce a family of sub-classes of ranked trees, called
Omega-trees, which are characterized by imbalance of internal nodes. The degree
of imbalance is defined by an integer 0 <= w. For caterpillars, the extreme
case of unbalanced trees, w = 0. Under models of neutral evolution, for
instance the Yule model, trees with small w are unlikely to occur by chance.
Indeed, imbalance can be a signature of permanent selection pressure, such as
observable in the genealogies of certain pathogens. From a mathematical point
of view it is interesting to observe that the space of Omega-trees maintains
several statistical invariants although it is drastically reduced in size
compared to the space of unconstrained Yule trees. Using generating functions,
we study here some basic combinatorial properties of Omega-trees. We focus on
the distribution of the number of subtrees with two leaves. We show that
expectation and variance of this distribution match those for unconstrained
trees already for very small values of w
Optimizing a Certified Proof Checker for a Large-Scale Computer-Generated Proof
In recent work, we formalized the theory of optimal-size sorting networks
with the goal of extracting a verified checker for the large-scale
computer-generated proof that 25 comparisons are optimal when sorting 9 inputs,
which required more than a decade of CPU time and produced 27 GB of proof
witnesses. The checker uses an untrusted oracle based on these witnesses and is
able to verify the smaller case of 8 inputs within a couple of days, but it did
not scale to the full proof for 9 inputs. In this paper, we describe several
non-trivial optimizations of the algorithm in the checker, obtained by
appropriately changing the formalization and capitalizing on the symbiosis with
an adequate implementation of the oracle. We provide experimental evidence of
orders of magnitude improvements to both runtime and memory footprint for 8
inputs, and actually manage to check the full proof for 9 inputs.Comment: IMADA-preprint-c
PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison
The selection, development, or comparison of machine learning methods in data
mining can be a difficult task based on the target problem and goals of a
particular study. Numerous publicly available real-world and simulated
benchmark datasets have emerged from different sources, but their organization
and adoption as standards have been inconsistent. As such, selecting and
curating specific benchmarks remains an unnecessary burden on machine learning
practitioners and data scientists. The present study introduces an accessible,
curated, and developing public benchmark resource to facilitate identification
of the strengths and weaknesses of different machine learning methodologies. We
compare meta-features among the current set of benchmark datasets in this
resource to characterize the diversity of available data. Finally, we apply a
number of established machine learning methods to the entire benchmark suite
and analyze how datasets and algorithms cluster in terms of performance. This
work is an important first step towards understanding the limitations of
popular benchmarking suites and developing a resource that connects existing
benchmarking standards to more diverse and efficient standards in the future.Comment: 14 pages, 5 figures, submitted for review to JML
- …