202 research outputs found
How to compare arc-annotated sequences: The alignment hierarchy
International audienceWe describe a new unifying framework to express comparison of arc-annotated sequences, which we call alignment of arc-annotated sequences. We first prove that this framework encompasses main existing models, which allows us to deduce complexity results for several cases from the literature. We also show that this framework gives rise to new relevant problems that have not been studied yet. We provide a thorough analysis of these novel cases by proposing two polynomial time algorithms and an NP-completeness proof. This leads to an almost exhaustive study of alignment of arc-annotated sequences
Festparameter-Algorithmen fuer die Konsens-Analyse Genomischer Daten
Fixed-parameter algorithms offer a constructive and powerful approach
to efficiently obtain solutions for NP-hard problems combining two
important goals: Fixed-parameter algorithms compute optimal solutions
within provable time bounds despite the (almost inevitable)
computational intractability of NP-hard problems. The essential idea
is to identify one or more aspects of the input to a problem as the
parameters, and to confine the combinatorial explosion of
computational difficulty to a function of the parameters such that the
costs are polynomial in the non-parameterized part of the input. This
makes especially sense for parameters which have small values in
applications. Fixed-parameter algorithms have become an established
algorithmic tool in a variety of application areas, among them
computational biology where small values for problem parameters are
often observed. A number of design techniques for fixed-parameter
algorithms have been proposed and bounded search trees are one of
them. In computational biology, however, examples of bounded search
tree algorithms have been, so far, rare.
This thesis investigates the use of bounded search tree algorithms for
consensus problems in the analysis of DNA and RNA data. More
precisely, we investigate consensus problems in the contexts of
sequence analysis, of quartet methods for phylogenetic reconstruction,
of gene order analysis, and of RNA secondary structure comparison. In
all cases, we present new efficient algorithms that incorporate the
bounded search tree paradigm in novel ways. On our way, we also obtain
results of parameterized hardness, showing that the respective
problems are unlikely to allow for a fixed-parameter algorithm, and we
introduce integer linear programs (ILP's) as a tool for classifying
problems as fixed-parameter tractable, i.e., as having fixed-parameter
algorithms. Most of our algorithms were implemented and tested on
practical data.Festparameter-Algorithmen bieten einen konstruktiven Ansatz zur
Loesung von kombinatorisch schwierigen, in der Regel NP-harten
Problemen, der zwei Ziele beruecksichtigt: innerhalb von beweisbaren
Laufzeitschranken werden optimale Ergebnisse berechnet. Die
entscheidende Idee ist dabei, einen oder mehrere Aspekte der
Problemeingabe als Parameter der Problems aufzufassen und die
kombinatorische Explosion der algorithmischen Schwierigkeit auf diese
Parameter zu beschraenken, so dass die Laufzeitkosten polynomiell in
Bezug auf den nicht-parametrisierten Teil der Eingabe sind. Gibt es
einen Festparameter-Algorithmus fuer ein kombinatorisches Problem,
nennt man das Problem festparameter-handhabbar. Die Entwicklung von
Festparameter-Algorithmen macht vor allem dann Sinn, wenn die
betrachteten Parameter im Anwendungsfall nur kleine Werte
annehmen. Festparameter-Algorithmen sind zu einem algorithmischen
Standardwerkzeug in vielen Anwendungsbereichen geworden, unter anderem
in der algorithmischen Biologie, wo in vielen Anwendungen kleine
Parameterwerte beobachtet werden koennen. Zu den bekannten Techniken
fuer den Entwurf von Festparameter-Algorithmen gehoeren unter anderem
groessenbeschraenkte Suchbaeume. In der algorithmischen Biologie gibt
es bislang nur wenige Beispiele fuer die Anwendung von
groessenbeschraenkten Suchbaeumen.
Diese Arbeit untersucht den Einsatz groessenbeschraenkter Suchbaeume
fuer NP-harte Konsens-Probleme in der Analyse von DNS- und
RNS-Daten. Wir betrachten Konsens-Probleme in der Analyse von
DNS-Sequenzdaten, in der Analyse von sogenannten Quartettdaten zur
Erstellung von phylogenetischen Hypothesen, in der Analyse von Daten
ueber die Anordnung von Genen und beim Vergleich von
RNS-Strukturdaten. In allen Faellen stellen wir neue effiziente
Algorithmen vor, in denen das Paradigma der groessenbeschraenkten
Suchbaeume auf neuartige Weise realisiert wird. Auf diesem Weg zeigen
wir auch Ergebnisse parametrisierter Haerte, die zeigen, dass fuer
die dabei betrachteten Probleme ein Festparameter-Algorithmus
unwahrscheinlich ist. Ausserdem fuehren wir ganzzahliges lineares
Programmieren als eine neue Technik ein, um die
Festparameter-Handhabbarkeit eines Problems zu zeigen. Die Mehrzahl
der hier vorgestellten Algorithmen wurde implementiert und auf
Anwendungsdaten getestet
Fixed-parameter algorithms for some combinatorial problems in bioinformatics
Fixed-parameterized algorithmics has been developed in 1990s as an approach to solve NP-hard problem optimally in a guaranteed running time. It offers a new opportunity to solve NP-hard problems exactly even on large problem instances.
In this thesis, we apply fixed-parameter algorithms to cope with three NP-hard problems in bioinformatics:
Flip Consensus Tree Problem is a combinatorial problem arising in computational phylogenetics. Using the formulation of the Flip Consensus Tree Problem as a graph-modification problem, we present a set of data reduction rules and two fixed-parameter algorithms with respect to the number of modifications. Additionally, we discuss several heuristic improvements to accelerate the running time of our algorithms in practice. We also report computational results on phylogenetic data.
Weighted Cluster Editing Problem is a graph-modification problem, that arises in computational biology when clustering objects with respect to a given similarity or distance measure. We present one of our fixed-parameter algorithms with respect to the minimum modification cost and describe the idea of our fastest algorithm for this problem and its unweighted counterpart.
Bond Order Assignment Problem asks for a bond order assignment of a molecule graph that minimizes a penalty function. We prove several complexity results on this problem and give two exact fixed-parameter algorithms for the problem. Our algorithms base on the dynamic programming approach on a tree decomposition of the molecule graph. Our algorithms are fixed-parameter with respect to the treewidth of the molecule graph and the maximum atom valence. We implemented one of our algorithms with several heuristic improvements and evaluate our algorithm on a set of real molecule graphs. It turns out that our algorithm is very fast on this dataset and even outperforms a heuristic algorithm that is usually used in practice
Dynamic programming based RNA pseudoknot alignment
Pseudoknots are certain structural motifs of RNA molecules. In this thesis we consider the problem of RNA pseudoknot alignment. Most current approaches either discard pseudoknots in order to be efficient or rely on heuristics generating only approximate solutions. This work focuses on dynamic programming based alignment methods and proposes two new approaches for an exact solution of the alignment problem in the presence of pseudoknot structures. The first approach is able to handle arbitrary pseudoknots, however, does not guarantee a polynomial runtime for all instances, due to the NP-hardness of the problem. Nevertheless, an analysis in terms of parameterized complexity shows that the algorithm is fixed parameter tractable for a parameter that is small in practice. The second approach is a general scheme for the alignment of restricted classes of pseudoknots in polynomial time. It is motivated by existing RNA pseudoknot prediction algorithms. We show how to embed seven of those algorithms in a common scheme and present an analogous scheme for the alignment problem, which yields for each of the structure prediction algorithms a corresponding alignment algorithm. The alignment algorithms handle the same class of pseudoknots as the corresponding prediction algorithms and the time and space complexity is only increased by a linear factor, compared to the respective prediction algorithm. Both approaches have been implemented to evaluate their applicability in practice.In dieser Dissertation beschäftige ich mich mit dem Alignment von bestimmten RNA Strukturen, die als Pseudoknoten bezeichnet werden. Da dieses Problem NP-hart ist, berücksichtigen die meisten bisher verfügbaren Alignmentverfahren um effizient zu sein entweder keine Pseudoknoten oder berechnen nur approximierte Lösungen mit Hilfe von Heuristiken. In der vorliegenden Arbeit beschreibe ich zwei neue Verfahren, die mit Hilfe von dynamischer Programmierung eine exakte Lösung für das Alignmentproblem von Pseudoknotenstrukturen berechnen. Das erste Verfahren kann beliebige Pseudoknoten alignieren und hat, da es sich hierbei um ein NPhartes Problem handelt, im allgemeinen keine polynomiell beschränkte Laufzeit. Eine parametrische Komplexitätsanalyse zeigt allerdings, dass der Algorithmus parametrisierbar (fixed parameter tractable) in Bezug auf einen in der Praxis kleinen Parameter ist. Das zweite Verfahren ermöglicht es, unterschiedliche eingeschränkte Klassen von Pseudoknoten in polynomieller Zeit zu alignieren. In einem ersten Schritt zeige ich hierzu, wie man existierende Vorhersagealgorithmen für sieben solcher Klassen in ein gemeinsames Schema einbetten kann. Dann entwickele ich ein analoges Schema für das Alignment von Pseudoknoten, das zu jedem der Vorhersagealgorithmen einen entsprechenden Alignmentalgorithmus mit nur linear erhöhter Speicher- und Zeitkomplexität liefert. Beide Verfahren wurden auch implementiert um die Praxistauglichkeit zu evaluieren
Inferring Genomic Sequences
Recent advances in next generation sequencing have provided unprecedented opportunities for high-throughput genomic research, inexpensively producing millions of genomic sequences in a single run. Analysis of massive volumes of data results in a more accurate picture of the genome complexity and requires adequate bioinformatics support. We explore computational challenges of applying next generation sequencing to particular applications, focusing on the problem of reconstructing viral quasispecies spectrum from pyrosequencing shotgun reads and problem of inferring informative single nucleotide polymorphisms (SNPs), statistically covering genetic variation of a genome region in genome-wide association studies.
The genomic diversity of viral quasispecies is a subject of a great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software cannot be used to simultaneously assemble and estimate the abundance of multiple closely related (but non-identical) quasispecies sequences. Here, we introduce a new Viral Spectrum Assembler (ViSpA) for inferring quasispecies spectrum and compare it with the state-of-the-art ShoRAH tool on both synthetic and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. While ShoRAH has an advanced error correction algorithm, ViSpA is better at quasispecies assembling, producing more accurate reconstruction of a viral population. We also foresee ViSpA application to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.
Due to the large data volume in genome-wide association studies, it is desirable to find a small subset of SNPs (tags) that covers the genetic variation of the entire set. We explore the trade-off between the number of tags used per non-tagged SNP and possible overfitting and propose an efficient 2LR-Tagging heuristic
Computational haplotyping : theory and practice
Genomics has paved a new way to comprehend life and its evolution, and also to investigate causes of diseases and their treatment. One of the important problems in genomic analyses is haplotype assembly. Constructing complete and accurate haplotypes plays an essential role in understanding population genetics and how species evolve. In this thesis, we focus on computational approaches to haplotype assembly from third generation sequencing technologies. This involves huge amounts of sequencing data, and such data contain errors due to the single molecule sequencing protocols employed. Taking advantage of combinatorial formulations helps to correct for these errors to solve the haplotyping problem. Various computational techniques such as dynamic programming, parameterized algorithms, and graph algorithms are used to solve this problem. This thesis presents several contributions concerning the area of haplotyping. First, a novel algorithm based on dynamic programming is proposed to provide approximation guarantees for phasing a single individual. Second, an integrative approach is introduced to combining multiple sequencing datasets to generating complete and accurate haplotypes. The effectiveness of this integrative approach is demonstrated on a real human genome. Third, we provide a novel efficient approach to phasing pedigrees and demonstrate its advantages in comparison to phasing a single individual. Fourth, we present a generalized graph-based framework for performing haplotype-aware de novo assembly. Specifically, this generalized framework consists of a hybrid pipeline for generating accurate and complete haplotypes from data stemming from multiple sequencing technologies, one that provides accurate reads and other that provides long reads.Die Genomik hat neue Wege eröffnet, die es ermöglichen, die Evolution lebendiger Organismen zu verstehen, sowie die Ursachen zahlreicher Krankheiten zu erforschen und neue Therapien zu entwickeln. Ein wichtiges Problem ist die Assemblierung der Haplotypen eines Individuums. Diese Rekonstruktion von Haplotypen spielt eine zentrale Rolle für das Verständnis der Populationsgenetik und der Evolution einer Spezies. In der vorliegenden Arbeit werden Algorithmen zur Assemblierung von Haplotypen vorgestellt, die auf Sequenzierdaten der dritten Generation basieren. Dies erfordert große Mengen an Daten, welche wiederum Fehler enthalten, die die zugrunde liegenden Sequenzierprotokolle hervorbringen. Durch kombinatorische Formulierungen des Problems ist die Rekonstruktion von Haplotypen dennoch möglich, da Fehler erfolgreich korrigiert werden können. Verschiedene informatische Methoden, wie dynamische Programmierung, parametrisierte Algorithmen und Graph Algorithmen können verwendet werden, um dieses Problem zu lösen. Die vorliegende Arbeit stellt mehrere Lösungsansätze für die Rekonstruktion von Haplotypen vor. Als erstes wird ein neuartiger Algorithmus vorgestellt, der basierend auf dem Prinzip der dynamischen Programmierung Approximationsgarantien für das Haplotyping eines einzelnen Individuums liefert. Als zweites wird ein integrativer Ansatz präsentiert, um mehrere Sequenzierdatensätze zu kombinieren und somit akkurate Haplotypen zu generieren. Die Effektivität dieser Methode wird auf einem echten, menschlichen Datensatz demonstriert. Als drittes wird ein neuer, effzienter Algorithmus beschrieben, um Haplotypen verwandter Individuen simultan zu konstruieren und die Vorteile gegenüber der Betrachtung einzelner Individuen aufgezeigt. Als viertes präsentieren wir eine Graph-basierte Methode um mittels Haplotypinformation de-novo Assemblierung durchzuführen. Dieser Methode kombiniert Daten stammend von verschiedenen Sequenziertechnologien, welche entweder genaue oder aber lange Sequenzierreads liefern
On Approximability of Steiner Tree in -metrics
In the Continuous Steiner Tree problem (CST), we are given as input a set of
points (called terminals) in a metric space and ask for the minimum-cost tree
connecting them. Additional points (called Steiner points) from the metric
space can be introduced as nodes in the solution. In the Discrete Steiner Tree
problem (DST), we are given in addition to the terminals, a set of facilities,
and any solution tree connecting the terminals can only contain the Steiner
points from this set of facilities. Trevisan [SICOMP'00] showed that CST and
DST are APX-hard when the input lies in the -metric (and Hamming
metric). Chleb\'ik and Chleb\'ikov\'a [TCS'08] showed that DST is NP-hard to
approximate to factor of in the graph metric (and
consequently -metric). Prior to this work, it was unclear if CST
and DST are APX-hard in essentially every other popular metric! In this work,
we prove that DST is APX-hard in every -metric. We also prove that CST
is APX-hard in the -metric. Finally, we relate CST and DST,
showing a general reduction from CST to DST in -metrics. As an
immediate consequence, this yields a -approximation polynomial time
algorithm for CST in -metrics.Comment: Abstract shortened due to arxiv's requirement
New Computational Approaches For Multiple Rna Alignment And Rna Search
In this thesis we explore the the theory and history behind RNA alignment. Normal sequence alignments as studied by computer scientists can be completed in O(n2) time in the naive case. The process involves taking two input sequences and finding the list of edits that can transform one sequence into the other. This process is applied to biology in many forms, such as the creation of multiple alignments and the search of genomic sequences. When you take into account the RNA sequence structure the problem becomes even harder. Multiple RNA structure alignment is particularly challenging because covarying mutations make sequence information alone insufficient. Existing tools for multiple RNA alignments first generate pair-wise RNA structure alignments and then build the multiple alignment using only the sequence information. Here we present PMFastR, an algorithm which iteratively uses a sequence-structure alignment procedure to build a multiple RNA structure alignment. PMFastR also has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. Specifically, we reduce the memory consumption to ∼O(band2 ∗ m) where band is the banding size. Other solutions are ∼ O(n2 ∗ m) where n and m are the lengths of the target and query respectively. The algorithm also provides a method to utilize a multi-core environment. We present results on benchmark data sets from BRAliBase, which shows PMFastR outperforms other state-of-the-art programs. Furthermore, we regenerate 607 Rfam seed alignments and show that our automated process creates similar multiple alignments to the manually-curated Rfam seed alignments. While these methods can also be applied directly to genome sequence search, the abundance of new multiple species genome alignments presents a new area for exploration. Many multiple alignments of whole genomes are available and these alignments keep growing in size. These alignments can provide more information to the searcher than just a single sequence. Using the methodology from sequence-structure alignment we developed AlnAlign, which searches an entire genome alignment using RNA sequence structure. While programs have been readily available to align alignments, this is the first to our knowledge that is specifically designed for RNA sequences. This algorithm is presented only in theory and is yet to be tested
- …