Search CORE

213 research outputs found

Average-case analysis of perfect sorting by reversals (Journal Version)

Author: Bouvel Mathilde
Chauve Cedric
Mishna Marni
Rossin Dominique
Publication venue
Publication date: 01/01/2011
Field of study

Perfect sorting by reversals, a problem originating in computational genomics, is the process of sorting a signed permutation to either the identity or to the reversed identity permutation, by a sequence of reversals that do not break any common interval. B\'erard et al. (2007) make use of strong interval trees to describe an algorithm for sorting signed permutations by reversals. Combinatorial properties of this family of trees are essential to the algorithm analysis. Here, we use the expected value of certain tree parameters to prove that the average run-time of the algorithm is at worst, polynomial, and additionally, for sufficiently long permutations, the sorting algorithm runs in polynomial time with probability one. Furthermore, our analysis of the subclass of commuting scenarios yields precise results on the average length of a reversal, and the average number of reversals.Comment: A preliminary version of this work appeared in the proceedings of Combinatorial Pattern Matching (CPM) 2009. See arXiv:0901.2847; Discrete Mathematics, Algorithms and Applications, vol. 3(3), 201

arXiv.org e-Print Archive

Hal-Diderot

HAL-Polytechnique

Supertree construction by matrix representation with flip

Author: Chen Duhong
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2003
Field of study

Digital Repository @ Iowa State University (ISU)

A Simple Characterization of the Minimal Obstruction Sets for Three-State Perfect Phylogenies

Author: Fernández-Baca David
Shutters Brad
Publication venue
Publication date: 01/01/2011
Field of study

Lam, Gusfield, and Sridhar (2009) showed that a set of three-state characters has a perfect phylogeny if and only if every subset of three characters has a perfect phylogeny. They also gave a complete characterization of the sets of three three-state characters that do not have a perfect phylogeny. However, it is not clear from their characterization how to find a subset of three characters that does not have a perfect phylogeny without testing all triples of characters. In this note, we build upon their result by giving a simple characterization of when a set of three-state characters does not have a perfect phylogeny that can be inferred from testing all pairs of characters

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Efficient algorithms in analyzing genomic data

Author: Pan Feng
Publication venue
Publication date: 01/08/2009
Field of study

With the development of high-throughput and low-cost genotyping technologies, immense data can be cheaply and efficiently produced for various genetic studies. A typical dataset may contain hundreds of samples with millions of genotypes/haplotypes. In order to prevent data analysis from becoming a bottleneck, there is an evident need for fast and efficient analysis methods. My thesis focuses on two interesting and important genetic analyzing problems. Genome-wide Association mapping. The goal of genome wide association mapping is to identify genes or narrow regions in the genome which have significant statistical correlations to the given phenotypes. The discovery of these genes offers the potential for increased understanding of biological processes affecting phenotypes such as body weight and blood pressure. Sample selection for maximal Genetic Diversity. Given a large set of samples, it is usually more efficient to first conduct experiments on a small subset. Then the following question arises: What subset to use? There are many experimental scenarios where the ultimate objective is to maintain, or at least maximize, the genetic diversity within relatively small breeding populations. In my thesis, I developed the following efficient and effective algorithms to address these problems. Phylogeny-based Genom-wide association mapping: TreeQA: The algorithm uses local perfect phylogeny tree in genome wide analysis for genotype/phenotype association mapping. Samples are partitioned according to the sub-trees they belong to. The association between a tree and the phenotype is measured by some statistic tests. TreeQA+: TreeQA+ inherits all the advantages of TreeQA. Moreover, it improves TreeQA by incorporating sample correlations into the association study. Sample selection for maximal genetic diversity: Sample Selection in biallelic SNP Data: Samples are selected based on their genetic diversity among a set of SNPs. Given a set of samples, the algorithms search for the minimum subset that retains all diversity (or a high percentage of diversity). Representative Sample Selection in Non-Biallelic Data: For more general data (non-biallelic), information-theoretic measurements such as entropy and mutual information are used to measure the diversity of a sample subset. Samples are selected to maximize the original information retained

Carolina Digital Repository

Finding Optimal Triangulations Parameterized by Edge Clique Cover

Author: Korhonen Tuukka
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2020
Field of study

Peer reviewe

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Fast and accurate supertrees: towards large scale phylogenies

Author: Fleischauer Markus
Publication venue
Publication date: 01/01/2018
Field of study

Phylogenetics is the study of evolutionary relationships between biological entities; phylogenetic trees (phylogenies) are a visualization of these evolutionary relationships. Accurate approaches to reconstruct hylogenies from sequence data usually result in NPhard optimization problems, hence local search heuristics have to be applied in practice. These methods are highly accurate and fast enough as long as the input data is not too large. Divide-and-conquer techniques are a promising approach to boost scalability and accuracy of those local search heuristics on very large datasets. A divide-and-conquer method breaks down a large phylogenetic problem into smaller sub-problems that are computationally easier to solve. The sub-problems (overlapping trees) are then combined using a supertree method. Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there are methods that encode the source trees in a matrix and the supertree is constructed applying a local search heuristic to optimize the respective objective function. The most widely used supertree methods use such local search heuristics. However, to really improve the scalability of accurate tree reconstruction by divide-and-conquer approaches, accurate polynomial time methods are needed for the supertree reconstruction step. In this work, we present approaches for accurate polynomial time supertree reconstruction in particular Bad Clade Deletion (BCD), a novel heuristic supertree algorithm with polynomial running time. BCD uses minimum cuts to greedily delete a locally minimal number of columns from a matrix representation to make it compatible. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees if one exists. BCD can take support values of the source trees into account without an increase in complexity. We show how reliable clades can be used to restrict the search space for BCD and how those clades can be collected from the input data using the Greedy Strict Consensus Merger. Finally, we introduce a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of BCD with beam search extension is still polynomial. We present an exact and a randomized subroutine to generate suboptimal partial solutions. In our thorough evaluation on several simulated and biological datasets against a representative set of supertree methods we found that BCD is more accurate than the most accurate supertree methods when using support values and search space restriction on simulated data. Simultaneously BCD is faster than any other evaluated method. The beam search approach improved the accuracy of BCD on all evaluated datasets at the cost of speed. We found that BCD supertrees can boost maximum likelihood tree reconstruction when used as starting tree. Further, BCD could handle large scale datasets where local search heuristics did not converge in reasonable time. Due to its combination of speed, accuracy, and the ability to reconstruct the parent tree if one exists, BCD is a promising approach to enable outstanding scalability of divide-and-conquer approaches.Die Phylogenetik studiert die evolutionären Beziehungen zwischen biologischen Entitäten. Phylogenetische Bäume sind eine Visualisierung dieser Beziehungen. Akkurate Ansätze zur Rekonstruktion von Phylogenien aus Sequenzdaten führen in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis lokale Suchheuristiken angewendet werden müssen. Diese Methoden liefern akkurate Bäume und sind schnell genug, solange die Eingabedaten nicht zu groß werden. Teile-und-herrsche-Verfahren sind ein vielversprechender Ansatz, um Skalierbarkeit und Genauigkeit dieser lokalen Suchheuristiken auf sehr großen Datensätzen zu verbessern. Beim Teile-und-herrsche-Ansatz zerlegt man ein großes phylogenetisches Problem in kleinere Teilprobleme, die einfacher und schneller zu lösen sind. Die Teilprobleme, in diesem Fall überlappende Teilbäume, müssen dann zu einem gesamtheitlichen Baum kombiniert werden. Superbaummethoden verschmelzen solche überlappenden phylogenetischen Bäume zu einem Superbaum, der alle Taxa der Eingangsbäume enthält. Die Herausforderung bei der Superbaumrekonstruktion besteht darin, mit widersprüchlichen Eingabebäumen umzugehen. Es wurden viele verschiedene Algorithmen mit unterschiedlichen Zielfunktionen entwickelt, um solche Widersprüche möglichst sinnvoll aufzulösen. Verfahren, die auf der Kodierung der Eingabebäume als Matrixrepräsentation basieren, sind am weitesten verbreitet. Die zum Auflösen der Konflikte verwendeten Zielfunktionen führen in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis auch hier lokale Suchheuristiken zum Einsatz kommen. Da diese Ansätze nicht wesentlich besser mit der Größe der Eingabedaten skalieren als die direkte Rekonstruktion aus Sequenzdaten, werden für die Superbaumrekonstruktion in Teile-undherrsche-Ansätzen akkurate Polynomialzeitmethoden benötigt. Diese Arbeit beschäftigt sich mit der akkuraten Rekonstruktion von Superbäumen in Polynomialzeit. Wir präsentieren Bad Clade Deletion (BCD), eine neue Polynomialzeitheuristik zur Superbaumrekonstruktion. BCD verwendet minimale Schnitte in Graphen, um eine minimale Anzahl von Spalten aus der Matrixrepräsentation zu löschen, sodass diese konfliktfrei wird. Im Gegensatz zu lokalen Suchheuristiken garantiert BCD die Rekonstruktion einer perfekten Phylogenie, sofern eine solche für die Eingabematrix existiert. BCD ermöglicht es, Gütekriterien der Eingabebäume zu berücksichtigen, ohne dass sich dadurch die Komplexität erhöht. Weiterhin zeigen wir, wie zuverlässige Kladen verwendet werden können, um den Suchraum für BCD einzuschränken und wie man diese mit Hilfe des Greedy Strict Consensus Mergers aus den Eingabedaten gewinnen kann. Schließlich stellen wir eine Strahlensuche für BCD vor. Diese erlaubt es eine bestimmte Anzahl suboptimaler Teillösungen (anstatt nur der optimalen) zu berücksichtigen, um so das Gesamtergebnis zu verbessern. Die Worst-Case-Laufzeit der Strahlensuche ist immer noch polynomiell. Zur Berechnung suboptimaler Teillösungen stellen wir einen exakten und einen randomisierten Algorithmus vor. In einer ausführlichen Evaluation auf mehreren simulierten und biologischen Datensätzen vergleichen wir BCD mit einer repräsentativen Auswahl an Superbaummethoden. Wir haben herausgefunden, dass BCD bei Verwendung von Gütekriterien und Suchraumbeschränkung auf simulierten Daten genauer ist als die akkuratesten evaluierten Superbaummethoden. Gleichzeitig ist BCD deutlich schneller als alle evaluierten Methoden. Die Strahlensuche verbessert die Qualität der BCD-Bäume auf allen Datensätzen, allerdings auf Kosten der Laufzeit. Weiterhin fanden wir heraus, dass ein BCD-Superbaum, der als Startbaum verwendet wird, die Qualität einer Maximum-Likelihood-Baumrekonstruktion verbessern kann. Außerdem kann BCD Datensätze verarbeiten, die so groß sind, dass lokale Suchheuristiken auf diesen nicht mehr in angemessener Zeit konvergieren. Aufgrund der Kombination aus Geschwindigkeit, Genauigkeit und der Fähigkeit, den Elternbaum zu rekonstruieren, sofern ein solcher existiert, ist BCD ein vielversprechender Ansatz um die Skalierbarkeit von Teile-und-herrsche-Methoden entscheidend zu verbessern

Digitale Bibliothek Thüringen

GiRaF: robust, computational identification of influenza reassortments via graph mining

Author: Bokhari
Carl Kingsford
Chutinimitkul
Dawood
Drummond
Edgar
Gabriela
Ghedin
Gray
Holmes
Huelsenbeck
Huson
Huson
Huson
Jinyan
Kawaoka
Khiabanian
Kingsford
Kishino
Li
Macken
Martin
Martin
Mickevich
Nagarajan
Nelson
Niranjan Nagarajan
Obenauer
Padidam
Paraskevis
Planet
Posada
Rabadan
Rambaut
Rambaut
Salzberg
Takemae
Wan
Wilgenbusch
Publication venue: Oxford University Press
Publication date
Field of study

Reassortments in the influenza virus—a process where strains exchange genetic segments—have been implicated in two out of three pandemics of the 20th century as well as the 2009 H1N1 outbreak. While advances in sequencing have led to an explosion in the number of whole-genome sequences that are available, an understanding of the rate and distribution of reassortments and their role in viral evolution is still lacking. An important factor in this is the paucity of automated tools for confident identification of reassortments from sequence data due to the challenges of analyzing large, uncertain viral phylogenies. We describe here a novel computational method, called GiRaF (Graph-incompatibility-based Reassortment Finder), that robustly identifies reassortments in a fully automated fashion while accounting for uncertainties in the inferred phylogenies. The algorithms behind GiRaF search large collections of Markov chain Monte Carlo (MCMC)-sampled trees for groups of incompatible splits using a fast biclique enumeration algorithm coupled with several statistical tests to identify sets of taxa with differential phylogenetic placement. GiRaF correctly finds known reassortments in human, avian, and swine influenza populations, including the evolutionary events that led to the recent ‘swine flu’ outbreak. GiRaF also identifies several previously unreported reassortments via whole-genome studies to catalog events in H5N1 and swine influenza isolates

Crossref

PubMed Central

Improved Lower Bounds on the Compatibility of Multi-State Characters

Author: Fernández-Baca David
Shutters Brad
Vakati Sudheer
Publication venue
Publication date: 01/01/2012
Field of study

We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state characters: There exists a function

f(r)

such that, for any set

C

r

-state characters,

C

is compatible if and only if every subset of

f(r)

characters of

C

is compatible. We show that for every

r \ge 2

, there exists an incompatible set

C

\lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1

r

-state characters such that every proper subset of

C

is compatible. Thus,

f(r) \ge \lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1

for every

r \ge 2

. This improves the previous lower bound of

f(r) \ge r

given by Meacham (1983), and generalizes the construction showing that

f(4) \ge 5

given by Habib and To (2011). We prove our result via a result on quartet compatibility that may be of independent interest: For every integer

n \ge 4

, there exists an incompatible set

Q

\lfloor\frac{n-2}{2}\rfloor\cdot\lceil\frac{n-2}{2}\rceil + 1

quartets over

n

labels such that every proper subset of

Q

is compatible. We contrast this with a result on the compatibility of triplets: For every

n \ge 3

, if

R

is an incompatible set of more than

n-1

triplets over

n

labels, then some proper subset of

R

is incompatible. We show this upper bound is tight by exhibiting, for every

n \ge 3

, a set of

n-1

triplets over

n

taxa such that

R

is incompatible, but every proper subset of

R

is compatible

arXiv.org e-Print Archive

CiteSeerX

Towards characterizing the solution space of the 1-Dollo Phylogeny problem

Author: Xie Shunping
Publication venue
Publication date: 01/05/2019
Field of study

Cancer cells may mutate multiple times, from a normal state to a mutated state and vice versa. Given our sequenced data, we can model the mutation process with a phylogenetic tree. One representative model is the k-Dollo parsimony, where all observed mutations mutate from a single normal cell and each character of a cell is gained at most once and lost at most k times. We examine the 1-Dollo Phylogeny problem, does a 1-Dollo phylogeny, a tree that follows the 1-Dollo parsimony model, exist for the observations. Current algorithms to solve the 1-Dollo Phylogeny problem only tell us whether or not a set of observations has a 1-Dollo phylogeny by outputting a single solution. We explore the structure of 1-Dollo phylogenies and use our idea of a skeleton to develop an algorithm that enumerates all 1-Dollo phylogenies for any set of observations. This algorithm runs much faster than the naive brute force enumeration algorithm for random input. The implementation is here: https://github.com/sxie12/skeleton_solver

Illinois Digital Environment for Access to Learning and Scholarship Repository

10231 Abstracts Collection -- Structure Discovery in Biology: Motifs, Networks & Phylogenies

Author: Apostolico Alberto
Dress Andreas
Parida Laxmi
Publication venue: Dagstuhl Seminar Proceedings. 10231 - Structure Discovery in Biology: Motifs, Networks & Phylogenies
Publication date: 01/01/2010
Field of study

From 06.06. to 11.06.2010, the Dagstuhl Seminar 10231 ``Structure Discovery in Biology: Motifs, Networks & Phylogenies \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

Dagstuhl Research Online Publication Server