34,812 research outputs found
In Search of Optimal Linkage Trees
Linkage-learning Evolutionary Algorithms (EAs) use linkage
learning to construct a linkage model, which is exploited
to solve problems efficiently by taking into account important
linkages, i.e. dependencies between problem variables,
during variation. It has been shown that when this linkage
model is aligned correctly with the structure of the problem,
these EAs are capable of solving problems efficiently by
performing variation based on this linkage model [2]. The
Linkage Tree Genetic Algorithm (LTGA) uses a Linkage Tree
(LT) as a linkage model to identify the problem's structure
hierarchically, enabling it to solve various problems very
efficiently. Understanding the reasons for LTGA's excellent
performance is highly valuable as LTGA is also able to
efficiently solve problems for which a tree-like linkage model
seems inappropriate. This brings us to ask what in fact
makes a linkage model ideal for LTGA to be used
A rapid and scalable method for multilocus species delimitation using Bayesian model comparison and rooted triplets
Multilocus sequence data provide far greater power to resolve species limits than the single locus data typically used for broad surveys of clades. However, current statistical methods based on a multispecies coalescent framework are computationally demanding, because of the number of possible delimitations that must be compared and time-consuming likelihood calculations. New methods are therefore needed to open up the power of multilocus approaches to larger systematic surveys. Here, we present a rapid and scalable method that introduces two new innovations. First, the method reduces the complexity of likelihood calculations by decomposing the tree into rooted triplets. The distribution of topologies for a triplet across multiple loci has a uniform trinomial distribution when the 3 individuals belong to the same species, but a skewed distribution if they belong to separate species with a form that is specified by the multispecies coalescent. A Bayesian model comparison framework was developed and the best delimitation found by comparing the product of posterior probabilities of all triplets. The second innovation is a new dynamic programming algorithm for finding the optimum delimitation from all those compatible with a guide tree by successively analyzing subtrees defined by each node. This algorithm removes the need for heuristic searches used by current methods, and guarantees that the best solution is found and potentially could be used in other systematic applications. We assessed the performance of the method with simulated, published and newly generated data. Analyses of simulated data demonstrate that the combined method has favourable statistical properties and scalability with increasing sample sizes. Analyses of empirical data from both eukaryotes and prokaryotes demonstrate its potential for delimiting species in real cases
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
Scalability of Genetic Programming and Probabilistic Incremental Program Evolution
This paper discusses scalability of standard genetic programming (GP) and the
probabilistic incremental program evolution (PIPE). To investigate the need for
both effective mixing and linkage learning, two test problems are considered:
ORDER problem, which is rather easy for any recombination-based GP, and TRAP or
the deceptive trap problem, which requires the algorithm to learn interactions
among subsets of terminals. The scalability results show that both GP and PIPE
scale up polynomially with problem size on the simple ORDER problem, but they
both scale up exponentially on the deceptive problem. This indicates that while
standard recombination is sufficient when no interactions need to be
considered, for some problems linkage learning is necessary. These results are
in agreement with the lessons learned in the domain of binary-string genetic
algorithms (GAs). Furthermore, the paper investigates the effects of
introducing utnnecessary and irrelevant primitives on the performance of GP and
PIPE.Comment: Submitted to GECCO-200
Anytime Hierarchical Clustering
We propose a new anytime hierarchical clustering method that iteratively
transforms an arbitrary initial hierarchy on the configuration of measurements
along a sequence of trees we prove for a fixed data set must terminate in a
chain of nested partitions that satisfies a natural homogeneity requirement.
Each recursive step re-edits the tree so as to improve a local measure of
cluster homogeneity that is compatible with a number of commonly used (e.g.,
single, average, complete) linkage functions. As an alternative to the standard
batch algorithms, we present numerical evidence to suggest that appropriate
adaptations of this method can yield decentralized, scalable algorithms
suitable for distributed/parallel computation of clustering hierarchies and
online tracking of clustering trees applicable to large, dynamically changing
databases and anomaly detection.Comment: 13 pages, 6 figures, 5 tables, in preparation for submission to a
conferenc
A cost function for similarity-based hierarchical clustering
The development of algorithms for hierarchical clustering has been hampered
by a shortage of precise objective functions. To help address this situation,
we introduce a simple cost function on hierarchies over a set of points, given
pairwise similarities between those points. We show that this criterion behaves
sensibly in canonical instances and that it admits a top-down construction
procedure with a provably good approximation ratio
- …