7 research outputs found

    Efficient chaining of seeds in ordered trees

    Get PDF
    We consider here the problem of chaining seeds in ordered trees. Seeds are mappings between two trees Q and T and a chain is a subset of non overlapping seeds that is consistent with respect to postfix order and ancestrality. This problem is a natural extension of a similar problem for sequences, and has applications in computational biology, such as mining a database of RNA secondary structures. For the chaining problem with a set of m constant size seeds, we describe an algorithm with complexity O(m2 log(m)) in time and O(m2) in space

    Three ways to cover a graph

    Full text link
    We consider the problem of covering an input graph HH with graphs from a fixed covering class GG. The classical covering number of HH with respect to GG is the minimum number of graphs from GG needed to cover the edges of HH without covering non-edges of HH. We introduce a unifying notion of three covering parameters with respect to GG, two of which are novel concepts only considered in special cases before: the local and the folded covering number. Each parameter measures "how far'' HH is from GG in a different way. Whereas the folded covering number has been investigated thoroughly for some covering classes, e.g., interval graphs and planar graphs, the local covering number has received little attention. We provide new bounds on each covering number with respect to the following covering classes: linear forests, star forests, caterpillar forests, and interval graphs. The classical graph parameters that result this way are interval number, track number, linear arboricity, star arboricity, and caterpillar arboricity. As input graphs we consider graphs of bounded degeneracy, bounded degree, bounded tree-width or bounded simple tree-width, as well as outerplanar, planar bipartite, and planar graphs. For several pairs of an input class and a covering class we determine exactly the maximum ordinary, local, and folded covering number of an input graph with respect to that covering class.Comment: 20 pages, 4 figure

    Bioinformatics Methods for NMR Chemical Shift Data

    Get PDF
    Nuclear magnetic resonance spectroscopy (NMR) is one of the most important methods for measuring the three-dimensional structure of biomolecules. Despite major progress in the NMR methodology, the solution of a protein structure is still a tedious and time-consuming task. The goal of this thesis is to develop bioinformatics methods which may strongly accelerate the NMR process. This work concentrates on a special type of measurements, the so-called chemical shifts. Chemical shifts are routinely measured at the beginning of a structure resolution process. As all data from the laboratory, chemical shifts may be error-prone, which might complicate or even circumvent the use of this data. Therefore, as the first result of the thesis, we present CheckShift, a method which automatically corrects a frequent error in NMR chemical shift data. However, the main goal of this thesis is the extraction of structural information hidden in chemical shifts. SimShift was developed as a first step in this direction. SimShift is the first approach to identify structural similarities between proteins based on chemical shifts. Compared to methods based on the amino acid sequence alone, SimShift shows its strength in detecting distant structural relationships. As a natural further development of the pairwise comparison of proteins, the SimShift algorithm is adapted for database searching. Given a protein, the improved algorithm, named SimShiftDB, searches a database of solved proteins for structurally homologue entries. The search is based only on the amino acid sequence and the associated chemical shifts. The detected similarities are additionally ranked based on calculations of statistical significance. Finally, the Chemical Shift Pipeline, the main result of this work, is presented. By combining automatic chemical shift error correction (CheckShift) and the database search algorithm (SimShiftDB), it is possible to achieve very high quality in 70% to 80% of the similarities identified. Thereby, only about 10% of the predictions are in error.Die nukleare Magnetresonanz-Spektroskopie (NMR) ist eine der wichtigsten Methoden, um die drei-dimensionale Struktur von Biomolekülen zu bestimmen. Trotz großer Fortschritte in der Methodik der NMR ist die Auflösung einer Proteinstruktur immer noch eine komplizierte und zeitraubende Aufgabe. Das Ziel dieser Doktorarbeit ist es, Bioinformatik-Methoden zu entwickeln, die den Prozess der Strukturaufklärung durch NMR erheblich beschleunigen können. Zu diesem Zweck konzentriert sich diese Arbeit auf bestimmte Messdaten aus der NMR, die so genannten chemischen Verschiebungen. Chemische Verschiebungen werden standardmäßig zu Beginn einer Strukturauflösung bestimmt. Wie alle Labordaten können chemische Verschiebungen Fehler enthalten, die die Analyse erschweren, wenn nicht sogar unmöglich machen. Als erstes Resultat dieser Arbeit wird darum CheckShift präsentiert, eine Methode, die es ermöglich einen weit verbreiteten Fehler in chemischen Verschiebungsdaten automatisch zu korrigieren. Das Hauptziel dieser Doktorarbeit ist es jedoch, strukturelle Informationen aus chemischen Verschiebungen zu extrahieren. Als erster Schritt in diese Richtung wurde SimShift entwickelt. SimShift ermöglicht es zum ersten Mal, strukturelle Ähnlichkeiten zwischen Proteinen basierend auf chemischen Verschiebungen zu identifizieren. Der Vergleich zu Methoden, die nur auf der Aminosäurensequenz basieren, zeigt die Überlegenheit des verschiebungsbasierten Ansatzes. Als eine natürliche Erweiterung des paarweisen Vergleichs von Proteinen wird darauffolgend SimShiftDB vorgestellt. Gegeben ein Protein, durchsucht SimShiftDB eine Datenbank bekannter Proteinstrukturen nach strukturell homologen Einträgen. Die Suche basiert hierbei nur auf der Aminosäuresequenz und den chemischen Verschiebungen des Proteins. Die detektierten Ähnlichkeiten werden zusätzlich nach statistischer Signifikanz bewertet. Mit der Chemical Shift Pipeline wird schließlich das Hauptresultat der Dissertation vorgestellt. Durch die Kombination der automatischen Fehlerkorrektur (CheckShift) mit dem Datenbank-Suchalgorithmus (SimShiftDB), wird in 70% bis 80% der vorhergesagten strukturellen Ähnlichkeiten eine sehr hohe Qualität erreicht. Der Anteil der fehlerhaften Vorhersagen beträgt nur etwa 10%

    The Anti-Covering Location Problem: new modeling perspectives and solution approaches

    Get PDF
    Dispersive strategies and outcomes are readily apparent in many geographic contexts. In particular, dispersive strategies can be seen in activities such as: the scattering of military missile silos and ammunition bunkers, center-pivot crop irrigation systems, location of parks, franchise store location, and territorial species den/nest locations. Spatial optimization models represent dispersion where selected facility locations are maximally "packed" or maximally "separated." The Anti-Covering Location Problem, in particular, is one in which a maximum number of facilities are located within a region such that each facility is separated by at least a minimum distance standard from all others. In this context, facilities are "dispersed" from each other through the use of the minimum separation standard. Solutions to this problem are called maximally "packed" as there exists no opportunity to add facilities without violating minimum separation standards. The Anti-Covering Location Problem (ACLP) can be defined on a continuous space domain, or more commonly, using a finite set of discrete locations. In this dissertation, it is assumed that there exists a discrete set of sites, among which a number will be selected for facility locations, and that this general problem may represent a number of different problems ranging from habitat analysis to public policy analysis. The main objective of this dissertation is to propose a new and improved optimization model for the ACLP when applied to a discrete set of points on a Cartesian plane using a combination of separation conditions called core-and-wedge constraints. This model structure, by its very definition, demonstrates that all planar problems can be defined using at most seven clique constraints for each site. In addition, the use of an added set of facet constraints in reducing computational effort is explored. Anti-covering location model solutions are maximally packed, providing an "optimistic" estimate of what may be possible in dispersing facilities. But, what if less than optimal sites are employed in a dispersive pattern. That is, to what extent can an optimal maximally packed configuration be disrupted? This possibility is explored through the development of a new model, called the Disruptive Anti-Covering location model

    Discovery of Unconventional Patterns for Sequence Analysis: Theory and Algorithms

    Get PDF
    The biology community is collecting a large amount of raw data, such as the genome sequences of organisms, microarray data, interaction data such as gene-protein interactions, protein-protein interactions, etc. This amount is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. An inevitable first step towards making sense of the data is to study their regularities focusing on the non-random structures appearing surprisingly often in the input sequences: patterns. In this thesis we discuss three incarnations of the pattern discovery task, exploring three types of patterns that can model different regularities of the input dataset. While mask patterns have been designed to model short repeated biological sequences, showing a high conservation of their content at some specific positions, permutation patterns have been designed to detect repeated patterns whose parts maintain their physical adjacency but not their ordering in all the pattern occurrences. Transposons, instead, model mobile sequences in the input dataset, which can be discovered by comparing different copies of the same input string, detecting large insertions and deletions in their alignment
    corecore