10,368 research outputs found
An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings
A widely used method for determining the similarity of two labeled trees is
to compute a maximum agreement subtree of the two trees. Previous work on this
similarity measure is only concerned with the comparison of labeled trees of
two special kinds, namely, uniformly labeled trees (i.e., trees with all their
nodes labeled by the same symbol) and evolutionary trees (i.e., leaf-labeled
trees with distinct symbols for distinct leaves). This paper presents an
algorithm for comparing trees that are labeled in an arbitrary manner. In
addition to this generality, this algorithm is faster than the previous
algorithms.
Another contribution of this paper is on maximum weight bipartite matchings.
We show how to speed up the best known matching algorithms when the input
graphs are node-unbalanced or weight-unbalanced. Based on these enhancements,
we obtain an efficient algorithm for a new matching problem called the
hierarchical bipartite matching problem, which is at the core of our maximum
agreement subtree algorithm.Comment: To appear in Journal of Algorithm
The generalized Robinson-Foulds metric
The Robinson-Foulds (RF) metric is arguably the most widely used measure of
phylogenetic tree similarity, despite its well-known shortcomings: For example,
moving a single taxon in a tree can result in a tree that has maximum distance
to the original one; but the two trees are identical if we remove the single
taxon. To this end, we propose a natural extension of the RF metric that does
not simply count identical clades but instead, also takes similar clades into
consideration. In contrast to previous approaches, our model requires the
matching between clades to respect the structure of the two trees, a property
that the classical RF metric exhibits, too. We show that computing this
generalized RF metric is, unfortunately, NP-hard. We then present a simple
Integer Linear Program for its computation, and evaluate it by an
all-against-all comparison of 100 trees from a benchmark data set. We find that
matchings that respect the tree structure differ significantly from those that
do not, underlining the importance of this natural condition.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
Cutset Sampling for Bayesian Networks
The paper presents a new sampling methodology for Bayesian networks that
samples only a subset of variables and applies exact inference to the rest.
Cutset sampling is a network structure-exploiting application of the
Rao-Blackwellisation principle to sampling in Bayesian networks. It improves
convergence by exploiting memory-based inference algorithms. It can also be
viewed as an anytime approximation of the exact cutset-conditioning algorithm
developed by Pearl. Cutset sampling can be implemented efficiently when the
sampled variables constitute a loop-cutset of the Bayesian network and, more
generally, when the induced width of the networks graph conditioned on the
observed sampled variables is bounded by a constant w. We demonstrate
empirically the benefit of this scheme on a range of benchmarks
Machine Learning at Microsoft with ML .NET
Machine Learning is transitioning from an art and science into a technology
available to every developer. In the near future, every application on every
platform will incorporate trained models to encode data-based decisions that
would be impossible for developers to author. This presents a significant
engineering challenge, since currently data science and modeling are largely
decoupled from standard software development processes. This separation makes
incorporating machine learning capabilities inside applications unnecessarily
costly and difficult, and furthermore discourage developers from embracing ML
in first place. In this paper we present ML .NET, a framework developed at
Microsoft over the last decade in response to the challenge of making it easy
to ship machine learning models in large software applications. We present its
architecture, and illuminate the application demands that shaped it.
Specifically, we introduce DataView, the core data abstraction of ML .NET which
allows it to capture full predictive pipelines efficiently and consistently
across training and inference lifecycles. We close the paper with a
surprisingly favorable performance study of ML .NET compared to more recent
entrants, and a discussion of some lessons learned
CSNL: A cost-sensitive non-linear decision tree algorithm
This article presents a new decision tree learning algorithm called CSNL that induces Cost-Sensitive Non-Linear decision trees. The algorithm is based on the hypothesis that nonlinear decision nodes provide a better basis than axis-parallel decision nodes and utilizes discriminant analysis to construct nonlinear decision trees that take account of costs of misclassification.
The performance of the algorithm is evaluated by applying it to seventeen datasets and the results are compared with those obtained by two well known cost-sensitive algorithms, ICET and MetaCost, which generate multiple trees to obtain some of the best results to date. The results show that CSNL performs at least as well, if not better than these algorithms, in more than twelve of the datasets and is considerably faster. The use of bagging with CSNL further enhances its performance showing the significant benefits of using nonlinear decision nodes.
The performance of the algorithm is evaluated by applying it to seventeen data sets and the results are
compared with those obtained by two well known cost-sensitive algorithms, ICET and MetaCost, which generate multiple trees to obtain some of the best results to date.
The results show that CSNL performs at least as well, if not better than these algorithms, in more than twelve of the data sets and is considerably faster.
The use of bagging with CSNL further enhances its performance showing the significant benefits of using non-linear decision nodes
Faster Algorithms for the Maximum Common Subtree Isomorphism Problem
The maximum common subtree isomorphism problem asks for the largest possible
isomorphism between subtrees of two given input trees. This problem is a
natural restriction of the maximum common subgraph problem, which is -hard in general graphs. Confining to trees renders polynomial time
algorithms possible and is of fundamental importance for approaches on more
general graph classes. Various variants of this problem in trees have been
intensively studied. We consider the general case, where trees are neither
rooted nor ordered and the isomorphism is maximum w.r.t. a weight function on
the mapped vertices and edges. For trees of order and maximum degree
our algorithm achieves a running time of by
exploiting the structure of the matching instances arising as subproblems. Thus
our algorithm outperforms the best previously known approaches. No faster
algorithm is possible for trees of bounded degree and for trees of unbounded
degree we show that a further reduction of the running time would directly
improve the best known approach to the assignment problem. Combining a
polynomial-delay algorithm for the enumeration of all maximum common subtree
isomorphisms with central ideas of our new algorithm leads to an improvement of
its running time from to ,
where is the order of the larger tree, is the number of different
solutions, and is the minimum of the maximum degrees of the input
trees. Our theoretical results are supplemented by an experimental evaluation
on synthetic and real-world instances
Sparse Learning over Infinite Subgraph Features
We present a supervised-learning algorithm from graph data (a set of graphs)
for arbitrary twice-differentiable loss functions and sparse linear models over
all possible subgraph features. To date, it has been shown that under all
possible subgraph features, several types of sparse learning, such as Adaboost,
LPBoost, LARS/LASSO, and sparse PLS regression, can be performed. Particularly
emphasis is placed on simultaneous learning of relevant features from an
infinite set of candidates. We first generalize techniques used in all these
preceding studies to derive an unifying bounding technique for arbitrary
separable functions. We then carefully use this bounding to make block
coordinate gradient descent feasible over infinite subgraph features, resulting
in a fast converging algorithm that can solve a wider class of sparse learning
problems over graph data. We also empirically study the differences from the
existing approaches in convergence property, selected subgraph features, and
search-space sizes. We further discuss several unnoticed issues in sparse
learning over all possible subgraph features.Comment: 42 pages, 24 figures, 4 table
Localization in Unstructured Environments: Towards Autonomous Robots in Forests with Delaunay Triangulation
Autonomous harvesting and transportation is a long-term goal of the forest
industry. One of the main challenges is the accurate localization of both
vehicles and trees in a forest. Forests are unstructured environments where it
is difficult to find a group of significant landmarks for current fast
feature-based place recognition algorithms. This paper proposes a novel
approach where local observations are matched to a general tree map using the
Delaunay triangularization as the representation format. Instead of point cloud
based matching methods, we utilize a topology-based method. First, tree trunk
positions are registered at a prior run done by a forest harvester. Second, the
resulting map is Delaunay triangularized. Third, a local submap of the
autonomous robot is registered, triangularized and matched using triangular
similarity maximization to estimate the position of the robot. We test our
method on a dataset accumulated from a forestry site at Lieksa, Finland. A
total length of 2100\,m of harvester path was recorded by an industrial
harvester with a 3D laser scanner and a geolocation unit fixed to the frame.
Our experiments show a 12\,cm s.t.d. in the location accuracy and with
real-time data processing for speeds not exceeding 0.5\,m/s. The accuracy and
speed limit is realistic during forest operations
- …