816 research outputs found
Sparse Dynamic Programming on DAGs with Small Width
The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe
Using Minimum Path Cover to Boost Dynamic Programming on DAGs : Co-linear Chaining Extended
Peer reviewe
An exact mathematical programming approach to multiple RNA sequence-structure alignment
One of the main tasks in computational biology is the computation of
alignments of genomic sequences to reveal their commonalities. In case of DNA
or protein sequences, sequence information alone is usually sufficient to
compute reliable alignments. RNA molecules, however, build spatial
conformations—the secondary structure—that are more conserved than the actual
sequence. Hence, computing reliable alignments of RNA molecules has to take
into account the secondary structure. We present a novel framework for the
computation of exact multiple sequence-structure alignments: We give a graph-
theoretic representation of the sequence-structure alignment problem and
phrase it as an integer linear program. We identify a class of constraints
that make the problem easier to solve and relax the original integer linear
program in a Lagrangian manner. Experiments on a recently published benchmark
show that our algorithms has a comparable performance than more costly dynamic
programming algorithms, and outperforms all other approaches in terms of
solution quality with an increasing number of input sequences
Clustering of scientific fields by integrating text mining and bibliometrics.
De toenemende verspreiding van wetenschappelijke en technologische publicaties via het internet, en de beschikbaarheid ervan in grootschalige bibliografische databanken, leiden tot enorme mogelijkheden om de wetenschap en technologie in kaart te brengen. Ook de voortdurende toename van beschikbare rekenkracht en de ontwikkeling van nieuwe algoritmen dragen hiertoe bij. Belangrijke uitdagingen blijven echter bestaan. Dit proefschrift bevestigt de hypothese dat de nauwkeurigheid van zowel het clusteren van wetenschappelijke kennisgebieden als het classificeren van publicaties nog verbeterd kunnen worden door het integreren van tekstontginning en bibliometrie. Zowel de tekstuele als de bibliometrische benadering hebben voor- en nadelen, en allebei bieden ze een andere kijk op een corpus van wetenschappelijke publicaties of patenten. Enerzijds is er een schat aan tekstinformatie aanwezig in dergelijke documenten, anderzijds vormen de onderlinge citaties grote netwerken die extra informatie leveren. We integreren beide gezichtspunten en tonen hoe bestaande tekstuele en bibliometrische methoden kunnen verbeterd worden. De dissertatie is opgebouwd uit drie delen: Ten eerste bespreken we het gebruik van tekstontginningstechnieken voor informatievergaring en voor het in kaart brengen van kennis vervat in teksten. We introduceren en demonstreren het raamwerk voor tekstontginning, evenals het gebruik van agglomeratieve hiërarchische clustering. Voorts onderzoeken we de relatie tussen enerzijds de performantie van het clusteren en anderzijds het gewenste aantal clusters en het aantal factoren bij latent semantische indexering. Daarnaast beschrijven we een samengestelde, semi-automatische strategie om het aantal clusters in een verzameling documenten te bepalen. Ten tweede behandelen we netwerken die bestaan uit citaties tussen wetenschappelijke documenten en netwerken die ontstaan uit onderlinge samenwerkingsverbanden tussen auteurs. Dergelijke netwerken kunnen geanalyseerd worden met technieken van de bibliometrie en de grafentheorie, met als doel het rangschikken van relevante entiteiten, het clusteren en het ontdekken van gemeenschappen. Ten derde tonen we de complementariteit aan van tekstontginning en bibliometrie en stellen we mogelijkheden voor om beide werelden op correcte wijze te integreren. De performantie van ongesuperviseerd clusteren en van classificeren verbetert significant door het samenvoegen van de tekstuele inhoud van wetenschappelijke publicaties en de structuur van citatienetwerken. Een methode gebaseerd op statistische meta-analyse behaalt de beste resultaten en overtreft methoden die enkel gebaseerd zijn op tekst of citaties. Onze geïntegreerde of hybride strategieën voor informatievergaring en clustering worden gedemonstreerd in twee domeinstudies. Het doel van de eerste studie is het ontrafelen en visualiseren van de conceptstructuur van de informatiewetenschappen en het toetsen van de toegevoegde waarde van de hybride methode. De tweede studie omvat de cognitieve structuur, bibliometrische eigenschappen en de dynamica van bio-informatica. We ontwikkelen een methode voor dynamisch en geïntegreerd clusteren van evoluerende bibliografische corpora. Deze methode vergelijkt en volgt clusters doorheen de tijd. Samengevat kunnen we stellen dat we voor de complementaire tekst- en netwerkwerelden een hybride clustermethode ontwerpen die tegelijkertijd rekening houdt met beide paradigma's. We tonen eveneens aan dat de geïntegreerde zienswijze een beter begrip oplevert van de structuur en de evolutie van wetenschappelijke kennisgebieden.SISTA;
Novel Algorithms and Methodology to Help Unravel Secrets that Next Generation Sequencing Data Can Tell
The genome of an organism is its complete set of DNA nucleotides, spanning
all of its genes and also of its non-coding regions. It contains most of
the information necessary to build and maintain an organism. It is therefore
no surprise that sequencing the genome provides an invaluable tool for
the scientific study of an organism. Via the inference of an evolutionary
(phylogenetic) tree, DNA sequences can be used to reconstruct the evolutionary
history of a set of species. DNA sequences, or genotype data, has
also proven useful for predicting an organisms’ phenotype (i. e. observed
traits) from its genotype. This is the objective of association studies.
While methods for finding the DNA sequence of an organism have existed
for decades, the recent advent of Next Generation Sequencing (NGS) has
meant that the availability of such data has increased to such an extent
that the computational challenges that now form an integral part of biological
studies can no longer be ignored. By focusing on phylogenetics
and Genome-Wide Association Studies (GWAS), this thesis aims to help
address some of these challenges. As a consequence this thesis is in two
parts with the first one centring on phylogenetics and the second one on
GWAS.
In the first part, we present theoretical insights for reconstructing phylogenetic
trees from incomplete distances. This problem is important in the
context of NGS data as incomplete pairwise distances between organisms
occur frequently with such input and ignoring taxa for which information
is missing can introduce undesirable bias. In the second part we focus on
the problem of inferring population stratification between individuals in a
dataset due to reproductive isolation. While powerful methods for doing
this have been proposed in the literature, they tend to struggle when faced
with the sheer volume of data that comes with NGS. To help address this
problem we introduce the novel PSIKO software and show that it scales
very well when dealing with large NGS datasets
The anaerobic linalool metabolism in the betaproteobacteria Castellaniella defragrans 65Phen and Thauera linaloolentis 47Lol
The betaproteobacteria Castellaniella defragrans 65Phen and Thauera linaloolentis 47Lol were recently isolated on monoterpenes as sole carbon and energy source under denitrifying conditions. C. defragrans 65Phen metabolizes the hydrocarbon monoterpene beta-myrcene. Its activation is catalyzed by the bifunctional enzyme linalool dehydratase/isomerase. In the presented work, an improved purification protocol was developed to yield high amounts of protein for structural analysis by X-ray crystallography. The structure of the enzyme and a proposed mechanism are described. T. linaloolentis 47Lol uses the tertiary monoterpene alcohol linalool as sole carbon source. It is isomerized into the primary alcohol geraniol by the enzyme linalool isomerase. The presented work describes an enrichment of the enzyme from protein extracts and its characterization. Further degradation of geraniol via the acyclic terpene utilization pathway was shown by cultivation based and enzymatic experiments
Computation of Sensitive Multiple Spaced Seeds
Similarity search is one of the most important problem in bioinformatics, with application in read mapping, homology search, oligonucleotide design, etc. Similarity search is time and memory intensive, hence heuristic methods using multiple spaced seeds are commonly employed. A spaced seed is a string of 1 and *, where 1 represents a match position and * represent don\u27t care position. Seeds are used to discover regions with identity, thus, it is imperative to design seeds of high sensitivity, so as to maximize the number of hits.
We present SpEED2, a software program to generate multiple spaced seeds of high sensitivity. It uses a novel seed optimization approach and it outperforms all the leading programs used for designing multiple spaced seeds like Iedera, AcoSeeD, and rasbhari. Our algorithm will benefit several software that is dependent on good quality seeds for its operation like PatternHunter for similarity search, SHRiMP and BFAST for read mapping, bestPrimer for designing primers, and many more
- …