17 research outputs found

    An Even Faster and More Unifying Algorithm for Comparing Trees via Unbalanced Bipartite Matchings

    Full text link
    A widely used method for determining the similarity of two labeled trees is to compute a maximum agreement subtree of the two trees. Previous work on this similarity measure is only concerned with the comparison of labeled trees of two special kinds, namely, uniformly labeled trees (i.e., trees with all their nodes labeled by the same symbol) and evolutionary trees (i.e., leaf-labeled trees with distinct symbols for distinct leaves). This paper presents an algorithm for comparing trees that are labeled in an arbitrary manner. In addition to this generality, this algorithm is faster than the previous algorithms. Another contribution of this paper is on maximum weight bipartite matchings. We show how to speed up the best known matching algorithms when the input graphs are node-unbalanced or weight-unbalanced. Based on these enhancements, we obtain an efficient algorithm for a new matching problem called the hierarchical bipartite matching problem, which is at the core of our maximum agreement subtree algorithm.Comment: To appear in Journal of Algorithm

    Analyzing the Flow of Information from Initial Publishing to Wikipedia

    Get PDF
    This thesis covers my efforts at researching the factors that lead to a research paper being cited by Wikipedia. Wikipedia is one of the most popular websites on the internet for quickly learning about a specific topic. It achieved this by being able to back up its claims with cited sources, many of which are research papers. I wanted to see exactly how those papers were found by Wikipedia’s editors when they write the articles. To do this, I gathered thousands of computer science research papers from arXiv.org, as well as a selection of papers that were cited by Wikipedia, so that I could examine those papers and see what made them visible and attractive to the Wikipedia editors. After I gathered the information on how and when these papers are cited, I ran a series of tests on them to learn as much as I could about what causes a paper to be cited by Wikipedia. I discovered that papers that are cited by Wikipedia tend to be more popular than papers which are not cited by Wikipedia even before they are cited but getting cited by Wikipedia can result in a boost in popularity. Wikipedia editors also tend to choose papers that either showcase a creation of the author(s) or give a general overview on a topic. I also discovered one paper that was likely added to Wikipedia by the author in an attempt at increased visibility

    The generalized Robinson-Foulds metric

    Get PDF
    The Robinson-Foulds (RF) metric is arguably the most widely used measure of phylogenetic tree similarity, despite its well-known shortcomings: For example, moving a single taxon in a tree can result in a tree that has maximum distance to the original one; but the two trees are identical if we remove the single taxon. To this end, we propose a natural extension of the RF metric that does not simply count identical clades but instead, also takes similar clades into consideration. In contrast to previous approaches, our model requires the matching between clades to respect the structure of the two trees, a property that the classical RF metric exhibits, too. We show that computing this generalized RF metric is, unfortunately, NP-hard. We then present a simple Integer Linear Program for its computation, and evaluate it by an all-against-all comparison of 100 trees from a benchmark data set. We find that matchings that respect the tree structure differ significantly from those that do not, underlining the importance of this natural condition.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

    Analyzing the Flow of Information from Initial Publishing to Wikipedia

    Get PDF
    This thesis covers my efforts at researching the factors that lead to a research paper being cited by Wikipedia. Wikipedia is one of the most popular websites on the internet for quickly learning about a specific topic. It achieved this by being able to back up its claims with cited sources, many of which are research papers. I wanted to see exactly how those papers were found by Wikipedia’s editors when they write the articles. To do this, I gathered thousands of computer science research papers from arXiv.org, as well as a selection of papers that were cited by Wikipedia, so that I could examine those papers and see what made them visible and attractive to the Wikipedia editors. After I gathered the information on how and when these papers are cited, I ran a series of tests on them to learn as much as I could about what causes a paper to be cited by Wikipedia. I discovered that papers that are cited by Wikipedia tend to be more popular than papers which are not cited by Wikipedia even before they are cited but getting cited by Wikipedia can result in a boost in popularity. Wikipedia editors also tend to choose papers that either showcase a creation of the author(s) or give a general overview on a topic. I also discovered one paper that was likely added to Wikipedia by the author in an attempt at increased visibility

    Fixed Parameter Polynomial Time Algorithms for Maximum Agreement and Compatible Supertrees

    Get PDF
    Consider a set of labels LL and a set of trees {\mathcal T} = \{{\mathcal T}^{(1), {\mathcal T}^{(2), ..., {\mathcal T}^{(k) \$ where each tree {\mathcal T}^{(i)isdistinctlyleaflabeledbysomesubsetof is distinctly leaf-labeled by some subset of L.Onefundamentalproblemistofindthebiggesttree(denotedassupertree)torepresent. One fundamental problem is to find the biggest tree (denoted as supertree) to represent \mathcal T}whichminimizesthedisagreementswiththetreesin which minimizes the disagreements with the trees in {\mathcal T}undercertaincriteria.Thisproblemfindsapplicationsinphylogenetics,database,anddatamining.Inthispaper,wefocusontwoparticularsupertreeproblems,namely,themaximumagreementsupertreeproblem(MASP)andthemaximumcompatiblesupertreeproblem(MCSP).ThesetwoproblemsareknowntobeNPhardfor under certain criteria. This problem finds applications in phylogenetics, database, and data mining. In this paper, we focus on two particular supertree problems, namely, the maximum agreement supertree problem (MASP) and the maximum compatible supertree problem (MCSP). These two problems are known to be NP-hard for k \geq 3.ThispapergivesthefirstpolynomialtimealgorithmsforbothMASPandMCSPwhenboth. This paper gives the first polynomial time algorithms for both MASP and MCSP when both kandthemaximumdegree and the maximum degree D$ of the trees are constant

    Faster Algorithms for the Maximum Common Subtree Isomorphism Problem

    Get PDF
    The maximum common subtree isomorphism problem asks for the largest possible isomorphism between subtrees of two given input trees. This problem is a natural restriction of the maximum common subgraph problem, which is NP{\sf NP}-hard in general graphs. Confining to trees renders polynomial time algorithms possible and is of fundamental importance for approaches on more general graph classes. Various variants of this problem in trees have been intensively studied. We consider the general case, where trees are neither rooted nor ordered and the isomorphism is maximum w.r.t. a weight function on the mapped vertices and edges. For trees of order nn and maximum degree Δ\Delta our algorithm achieves a running time of O(n2Δ)\mathcal{O}(n^2\Delta) by exploiting the structure of the matching instances arising as subproblems. Thus our algorithm outperforms the best previously known approaches. No faster algorithm is possible for trees of bounded degree and for trees of unbounded degree we show that a further reduction of the running time would directly improve the best known approach to the assignment problem. Combining a polynomial-delay algorithm for the enumeration of all maximum common subtree isomorphisms with central ideas of our new algorithm leads to an improvement of its running time from O(n6+Tn2)\mathcal{O}(n^6+Tn^2) to O(n3+TnΔ)\mathcal{O}(n^3+Tn\Delta), where nn is the order of the larger tree, TT is the number of different solutions, and Δ\Delta is the minimum of the maximum degrees of the input trees. Our theoretical results are supplemented by an experimental evaluation on synthetic and real-world instances

    Enumerating All Maximal Frequent Subtrees

    Get PDF
    Given a collection of leaf-labeled trees on a common leafset and a fraction f in (1/2,1], a frequent subtree (FST) is a subtree isomorphically included in at least fraction f of the input trees. The well-known maximum agreement subtree (MAST) problem identifies FST with f = 1 and having the largest number of leaves. Apart from its intrinsic interest from the algorithmic perspective, MAST has practical applications as a metric for tree similarity, for computing tree congruence, in detection horizontal gene transfer events and as a consensus approach. Enumerating FSTs extend the MAST problem by denition and reveal additional subtrees not displayed by MAST. This can happen in tow ways - such a subtree is included in majority but not all of the input trees or such a subtree though included in all the input trees, does not have the maximum number of leaves. Further, FSTs can be enumerated on collection o ftrees having partially overlapping leafsets. MAST may not be useful here especially if the common overlap among leafsets is very low. Though very useful, the number of FSTs suffer from combinatorial explosion - just a single enumeration of maximal frequent subtrees (MFSTs). A MFST is a FST that is not a subtree to any othe rFST. the set of MFSTs is a compact non-redundant summary of all FSTs and is much smaller in size. Here we tackle the novel problem of enumerating all MFSTs in collections of phylogenetic trees. We demonstrate its utility in returning larger consensus trees in comparison to MAST. The current implementation is available on the web

    Faster Algorithms for Semi-Matching Problems

    Full text link
    We consider the problem of finding \textit{semi-matching} in bipartite graphs which is also extensively studied under various names in the scheduling literature. We give faster algorithms for both weighted and unweighted case. For the weighted case, we give an O(nmlogn)O(nm\log n)-time algorithm, where nn is the number of vertices and mm is the number of edges, by exploiting the geometric structure of the problem. This improves the classical O(n3)O(n^3) algorithms by Horn [Operations Research 1973] and Bruno, Coffman and Sethi [Communications of the ACM 1974]. For the unweighted case, the bound could be improved even further. We give a simple divide-and-conquer algorithm which runs in O(nmlogn)O(\sqrt{n}m\log n) time, improving two previous O(nm)O(nm)-time algorithms by Abraham [MSc thesis, University of Glasgow 2003] and Harvey, Ladner, Lov\'asz and Tamir [WADS 2003 and Journal of Algorithms 2006]. We also extend this algorithm to solve the \textit{Balance Edge Cover} problem in O(nmlogn)O(\sqrt{n}m\log n) time, improving the previous O(nm)O(nm)-time algorithm by Harada, Ono, Sadakane and Yamashita [ISAAC 2008].Comment: ICALP 201
    corecore