17 research outputs found
An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings
A widely used method for determining the similarity of two labeled trees is
to compute a maximum agreement subtree of the two trees. Previous work on this
similarity measure is only concerned with the comparison of labeled trees of
two special kinds, namely, uniformly labeled trees (i.e., trees with all their
nodes labeled by the same symbol) and evolutionary trees (i.e., leaf-labeled
trees with distinct symbols for distinct leaves). This paper presents an
algorithm for comparing trees that are labeled in an arbitrary manner. In
addition to this generality, this algorithm is faster than the previous
algorithms.
Another contribution of this paper is on maximum weight bipartite matchings.
We show how to speed up the best known matching algorithms when the input
graphs are node-unbalanced or weight-unbalanced. Based on these enhancements,
we obtain an efficient algorithm for a new matching problem called the
hierarchical bipartite matching problem, which is at the core of our maximum
agreement subtree algorithm.Comment: To appear in Journal of Algorithm
Analyzing the Flow of Information from Initial Publishing to Wikipedia
This thesis covers my efforts at researching the factors that lead to a research paper being cited by Wikipedia. Wikipedia is one of the most popular websites on the internet for quickly learning about a specific topic. It achieved this by being able to back up its claims with cited sources, many of which are research papers. I wanted to see exactly how those papers were found by Wikipedia’s editors when they write the articles. To do this, I gathered thousands of computer science research papers from arXiv.org, as well as a selection of papers that were cited by Wikipedia, so that I could examine those papers and see what made them visible and attractive to the Wikipedia editors.
After I gathered the information on how and when these papers are cited, I ran a series of tests on them to learn as much as I could about what causes a paper to be cited by Wikipedia. I discovered that papers that are cited by Wikipedia tend to be more popular than papers which are not cited by Wikipedia even before they are cited but getting cited by Wikipedia can result in a boost in popularity. Wikipedia editors also tend to choose papers that either showcase a creation of the author(s) or give a general overview on a topic. I also discovered one paper that was likely added to Wikipedia by the author in an attempt at increased visibility
The generalized Robinson-Foulds metric
The Robinson-Foulds (RF) metric is arguably the most widely used measure of
phylogenetic tree similarity, despite its well-known shortcomings: For example,
moving a single taxon in a tree can result in a tree that has maximum distance
to the original one; but the two trees are identical if we remove the single
taxon. To this end, we propose a natural extension of the RF metric that does
not simply count identical clades but instead, also takes similar clades into
consideration. In contrast to previous approaches, our model requires the
matching between clades to respect the structure of the two trees, a property
that the classical RF metric exhibits, too. We show that computing this
generalized RF metric is, unfortunately, NP-hard. We then present a simple
Integer Linear Program for its computation, and evaluate it by an
all-against-all comparison of 100 trees from a benchmark data set. We find that
matchings that respect the tree structure differ significantly from those that
do not, underlining the importance of this natural condition.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
Analyzing the Flow of Information from Initial Publishing to Wikipedia
This thesis covers my efforts at researching the factors that lead to a research paper being cited by Wikipedia. Wikipedia is one of the most popular websites on the internet for quickly learning about a specific topic. It achieved this by being able to back up its claims with cited sources, many of which are research papers. I wanted to see exactly how those papers were found by Wikipedia’s editors when they write the articles. To do this, I gathered thousands of computer science research papers from arXiv.org, as well as a selection of papers that were cited by Wikipedia, so that I could examine those papers and see what made them visible and attractive to the Wikipedia editors.
After I gathered the information on how and when these papers are cited, I ran a series of tests on them to learn as much as I could about what causes a paper to be cited by Wikipedia. I discovered that papers that are cited by Wikipedia tend to be more popular than papers which are not cited by Wikipedia even before they are cited but getting cited by Wikipedia can result in a boost in popularity. Wikipedia editors also tend to choose papers that either showcase a creation of the author(s) or give a general overview on a topic. I also discovered one paper that was likely added to Wikipedia by the author in an attempt at increased visibility
Fixed Parameter Polynomial Time Algorithms for Maximum Agreement and Compatible Supertrees
Consider a set of labels and a set of trees {\mathcal T} = \{{\mathcal
T}^{(1), {\mathcal T}^{(2), ..., {\mathcal T}^{(k) \$ where each tree
{\mathcal T}^{(i)L\mathcal T}{\mathcal T}k \geq 3kD$
of the trees are constant
Faster Algorithms for the Maximum Common Subtree Isomorphism Problem
The maximum common subtree isomorphism problem asks for the largest possible
isomorphism between subtrees of two given input trees. This problem is a
natural restriction of the maximum common subgraph problem, which is -hard in general graphs. Confining to trees renders polynomial time
algorithms possible and is of fundamental importance for approaches on more
general graph classes. Various variants of this problem in trees have been
intensively studied. We consider the general case, where trees are neither
rooted nor ordered and the isomorphism is maximum w.r.t. a weight function on
the mapped vertices and edges. For trees of order and maximum degree
our algorithm achieves a running time of by
exploiting the structure of the matching instances arising as subproblems. Thus
our algorithm outperforms the best previously known approaches. No faster
algorithm is possible for trees of bounded degree and for trees of unbounded
degree we show that a further reduction of the running time would directly
improve the best known approach to the assignment problem. Combining a
polynomial-delay algorithm for the enumeration of all maximum common subtree
isomorphisms with central ideas of our new algorithm leads to an improvement of
its running time from to ,
where is the order of the larger tree, is the number of different
solutions, and is the minimum of the maximum degrees of the input
trees. Our theoretical results are supplemented by an experimental evaluation
on synthetic and real-world instances
Enumerating All Maximal Frequent Subtrees
Given a collection of leaf-labeled trees on a common leafset and a fraction f in (1/2,1], a frequent subtree (FST) is a subtree isomorphically included in at least fraction f of the input trees. The well-known maximum agreement subtree (MAST) problem identifies FST with f = 1 and having the largest number of leaves. Apart from its intrinsic interest from the algorithmic perspective, MAST has practical applications as a metric for tree similarity, for computing tree congruence, in detection horizontal gene transfer events and as a consensus approach. Enumerating FSTs extend the MAST problem by denition and reveal additional subtrees not displayed by MAST. This can happen in tow ways - such a subtree is included in majority but not all of the input trees or such a subtree though included in all the input trees, does not have the maximum number of leaves. Further, FSTs can be enumerated on collection o ftrees having partially overlapping leafsets. MAST may not be useful here especially if the common overlap among leafsets is very low. Though very useful, the number of FSTs suffer from combinatorial explosion - just a single enumeration of maximal frequent subtrees (MFSTs). A MFST is a FST that is not a subtree to any othe rFST. the set of MFSTs is a compact non-redundant summary of all FSTs and is much smaller in size. Here we tackle the novel problem of enumerating all MFSTs in collections of phylogenetic trees. We demonstrate its utility in returning larger consensus trees in comparison to MAST. The current implementation is available on the web
Faster Algorithms for Semi-Matching Problems
We consider the problem of finding \textit{semi-matching} in bipartite graphs
which is also extensively studied under various names in the scheduling
literature. We give faster algorithms for both weighted and unweighted case.
For the weighted case, we give an -time algorithm, where is
the number of vertices and is the number of edges, by exploiting the
geometric structure of the problem. This improves the classical
algorithms by Horn [Operations Research 1973] and Bruno, Coffman and Sethi
[Communications of the ACM 1974].
For the unweighted case, the bound could be improved even further. We give a
simple divide-and-conquer algorithm which runs in time,
improving two previous -time algorithms by Abraham [MSc thesis,
University of Glasgow 2003] and Harvey, Ladner, Lov\'asz and Tamir [WADS 2003
and Journal of Algorithms 2006]. We also extend this algorithm to solve the
\textit{Balance Edge Cover} problem in time, improving the
previous -time algorithm by Harada, Ono, Sadakane and Yamashita [ISAAC
2008].Comment: ICALP 201