280 research outputs found

    Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

    Get PDF
    Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm

    On the role of metaheuristic optimization in bioinformatics

    Get PDF
    Metaheuristic algorithms are employed to solve complex and large-scale optimization problems in many different fields, from transportation and smart cities to finance. This paper discusses how metaheuristic algorithms are being applied to solve different optimization problems in the area of bioinformatics. While the text provides references to many optimization problems in the area, it focuses on those that have attracted more interest from the optimization community. Among the problems analyzed, the paper discusses in more detail the molecular docking problem, the protein structure prediction, phylogenetic inference, and different string problems. In addition, references to other relevant optimization problems are also given, including those related to medical imaging or gene selection for classification. From the previous analysis, the paper generates insights on research opportunities for the Operations Research and Computer Science communities in the field of bioinformatics

    Time Warp Edit Distance with Stiffness Adjustment for Time Series Matching

    Full text link
    In a way similar to the string-to-string correction problem we address time series similarity in the light of a time-series-to-time-series-correction problem for which the similarity between two time series is measured as the minimum cost sequence of "edit operations" needed to transform one time series into another. To define the "edit operations" we use the paradigm of a graphical editing process and end up with a dynamic programming algorithm that we call Time Warp Edit Distance (TWED). TWED is slightly different in form from Dynamic Time Warping, Longest Common Subsequence or Edit Distance with Real Penalty algorithms. In particular, it highlights a parameter which drives a kind of stiffness of the elastic measure along the time axis. We show that the similarity provided by TWED is a metric potentially useful in time series retrieval applications since it could benefit from the triangular inequality property to speed up the retrieval process while tuning the parameters of the elastic measure. In that context, a lower bound is derived to relate the matching of time series into down sampled representation spaces to the matching into the original space. Empiric quality of the TWED distance is evaluated on a simple classification task. Compared to Edit Distance, Dynamic Time Warping, Longest Common Subsequnce and Edit Distance with Real Penalty, TWED has proven to be quite effective on the considered experimental task

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    Sequence self-selection by the network dynamics of random ligating oligomer pools

    Get PDF
    One problem on the research on the Origin of Life is the step from short oligomers with random sequence of bases to longer, active complexes like ribozymes. Chemistry and non-equilibrium physics have found pathways for the formation of short strands, mediated by activated nucleotides or activation agents. It has also been shown, that medium length autocatalytically active complexes can form reaction networks necessary for downstream reactions and the emergence of more sophisticated systems. The intermediate step is difficult though: Short random sequences need to be extended, while introducing a reduction of sequence entropy, that does not inhibit double-stranded complex formation. This pre-selection step is necessary due to the vast sequence space of oligomer strands and must conform to an Origin-of-Life-like environment. The ancient reactions only depended on the inherent parameters of the oligomer strands such as the hybridization to double-stranded complexes and basic physical mechanisms such as temperature changes. Although short strands likely emerged by a non-templated polymerization-like extension mechanism, passing sequence information from one strand to another is assumed essential in longer strands. Chemical ligation is probable in the context of the Origin of Life, but the yield and specificity is known to be low. As a model for said mechanism, this study utilizes an evolved DNA ligase to facilitate the ligation reaction of the model-oligomer (DNA) without introducing a sequence bias. After 1000 temperature cycles, necessary for the dissociation and rehybridization of elongated strands, the resulting reaction products showed several distinct features of selection. Compared to random strands of the same length, the set of emerging reaction products have a significantly reduced sequence entropy. The emerging strands can be categorized in two groups: A-type and T-type strands with a ratio about 70:30 % of each base. As strands can only act as a template or substrate in the templated ligation reaction when they are in the single-stranded conformation, this selection inhibits the formation of self-folding oligomer products. At the same time the formation of double-stranded complexes is not reduced, as the A-type and T-type groups are predominantly sets of reverse complement sequences. Selecting only the most common subsequences from each group as a new starting pool shows a greater fitness of the selected pool for the emergence of new ligation products compared to a pool of random strands. While the analysis of the complex types and the exact dynamics is limited in the experimental system, all parameters can be accessed and analyzed in detail in a closely related theoretical simulation. This simulation found a distinct relation of the major reaction rates in the system: the hybridization on-rate, the ligation-rate and the hybridization-off rate. With those parameters the emergence of a local minimum-maximum feature in the concentration-over-strand-length analysis could be identified and the feature's appearance predicted in the experiment. The understanding of the dynamical regimes of the reaction also led to the understanding of length-dependent dominant growth modes of the product strands. Analyzing the emerging 24mers with a simple simulation based on these growth modes produced the same common ligation site sequence pattern ATAT. A small bias towards the self-complementary AT sequence motif at the 3'-end of strands in the original random sequence 12mer pool was identified as the likely cause. This study found,that longer product strands in this simple model system had a reduced entropy while retaining the important hybridization properties of the oligomer strands. The dynamics of the elongation are dependent on the microscopic rates of complex formation, ligation and dissociation, while the temperature cycling frequency imposes a rate-limit for the entire system and substantially changes the reaction dynamic.Ein Problem bei der Erforschung des Ursprungs des Lebens ist der Schritt von kurzen Oligomeren mit zufĂ€lliger Basenfolge zu lĂ€ngeren aktiven Komplexen wie Ribozymen. Chemie und physikalische Nichtgleichgewichtsysteme haben mögliche Wege fĂŒr die Synthese von kurzen StrĂ€ngen durch aktivierte Nukleotide oder sogenannte "activation agents" gefunden. Es wurde auch gezeigt, dass mittellange, autokatalytisch aktive Komplexe Reaktionsnetzwerke bilden können, die fĂŒr darauffolgende Reaktionen und die Entstehung anspruchsvollerer Systeme notwendig sind. Der Zwischenschritt ist schwieriger: Kurze StrĂ€nge mit zufĂ€lliger Sequenz mĂŒssen lĂ€nger werden, wobei gleichzeitig eine Verringerung der Sequenzentropie auftreten muss, die die Bildung von doppelstrĂ€ngigen Komplexen nicht beeintrĂ€chtigt. Dieser Vorselektionsschritt ist aufgrund des riesigen Sequenzraums der OligomerstrĂ€nge notwendig und muss zusĂ€tzlich den physikalischen Rahmenbedingungen einer Umgebung Ă€hnlich der frĂŒhen Erde entsprechen. Die ursprĂŒnglichen Reaktionen hingen vermutlich lediglich von den inhĂ€renten Parametern der OligomerstrĂ€nge wie der Hybridisierung zu Doppelstrangkomplexen und grundlegenden physikalischen Mechanismen, wie etwa TemperaturĂ€nderungen ab. Obwohl kurze StrĂ€nge wahrscheinlich durch einen nicht-templierten polymerisations-Ă€hnlichen VerlĂ€ngerungsmechanismus entstanden sind, wird bei lĂ€ngeren StrĂ€ngen die Weitergabe von Sequenzinformationen von einem Strang zum anderen als essentiell angesehen. Chemische Ligation ist im Zusammenhang mit dem Ursprung des Lebens wahrscheinlich, aber die Ausbeute und SpezifitĂ€t ist bekanntermaßen gering. Als Modell fĂŒr diesen Mechanismus wird in dieser Studie daher eine moderne DNA-Ligase biologischen Ursprungs verwendet, um die Ligationsreaktion des Modell-Oligomers (DNA) ohne einen möglichen Sequenz-Bias zu ermöglichen. Nach 1000 Temperaturzyklen, die fĂŒr die Dissoziation und Rehybridisierung der verlĂ€ngerten DoppelstrĂ€nge notwendig sind, zeigen die resultierenden Reaktionsprodukte mehrere deutliche Selektionsmerkmale. Im Vergleich zu zufĂ€lligen StrĂ€ngen gleicher LĂ€nge weisen die entstandenen Reaktionsprodukte eine deutlich reduzierte Sequenzentropie auf. Diese StrĂ€nge können in zwei Gruppen kategorisiert werden: A-Typ- und T-Typ-StrĂ€nge mit einem VerhĂ€ltnis von etwa 70:30 % der jeweiligen Base. Da die StrĂ€nge nur dann als Templat oder Substrat in der templierten Ligation reagieren können, wenn sie sich in der einzelstrĂ€ngigen Konformation befinden, hemmt diese Selektion die Bildung von selbstfaltenden Oligomerprodukten. Gleichzeitig wird die Bildung von doppelstrĂ€ngigen Komplexen nicht reduziert, da es sich bei den A- und T-Typ-Gruppen ĂŒberwiegend um Ensembles von Komplementsequenzen handelt. Eine Auswahl der hĂ€ufigsten Teilsequenzen aus jeder Gruppe als neuer Startpool zeigt eine höhere LiagtionsaktivitĂ€t des Pools im Vergleich zu einem Pool aus zufĂ€lligen StrĂ€ngen. WĂ€hrend die Analyse der Doppelstrang-Typen und der genauen Dynamik im experimentellen System begrenzt ist, können alle Parameter in einer eng verwandten theoretischen Simulation detailliert analysiert werden. Diese Simulation zeigt einen eindeutigen Zusammenhang der wichtigsten Reaktionsraten im System: die Hybridisierungs-On-Rate, die Ligations-Rate und die Hybridisierungs-Off-Rate. Mit diesen Parametern konnte das Auftreten einer lokalen Minimum-Maximum-Eigenschaft in der Konzentration-ĂŒber-StranglĂ€nge-Analyse identifiziert und das Auftreten des Merkmals im Experiment vorhergesagt werden. Das Ergebnis lĂ€ngenabhĂ€ngiger dynamischer Regime fĂŒhrte auch zum VerstĂ€ndnis der lĂ€ngenabhĂ€ngigen dominanten Wachstumsmodi der ProduktstrĂ€nge. Die Analyse der entstehenden 24mer mit einer einfachen Simulation, die auf diesen Wachstumsmodi basiert, zeigt ein hĂ€ufiges Sequenzmuster an Ligationsstellen: ATAT. Ein kleiner Bias in Richtung des selbstkomplementĂ€ren AT-Sequenzmotivs am 3'-Ende der StrĂ€nge im ursprĂŒnglichen 12mer-Pool kann als wahrscheinliche Ursache identifiziert werden. Diese Studie zeigt, wie lĂ€ngere, neu ligierte ProduktstrĂ€nge in einem einfachen Modellsystem eine reduzierte Entropie aufweisen, wĂ€hrend die wichtigen Hybridisierungseigenschaften der OligomerstrĂ€nge erhalten bleiben. Die Dynamik der Elongation hĂ€ngt von den mikroskopischen Raten der Komplexbildung, Ligation und Dissoziation ab, wĂ€hrend die Frequenz der Temperaturzyklen ein Ratenlimit fĂŒr das gesamte System bildet und die Reaktionsdynamik wesentlich verĂ€ndert

    Ab initio identification of putative human transcription factor binding sites by comparative genomics

    Get PDF
    We discuss a simple and powerful approach for the ab initio identification of cis-regulatory motifs involved in transcriptional regulation. The method we present integrates several elements: human-mouse comparison, statistical analysis of genomic sequences and the concept of coregulation. We apply it to a complete scan of the human genome. By using the catalogue of conserved upstream sequences collected in the CORG database we construct sets of genes sharing the same overrepresented motif (short DNA sequence) in their upstream regions both in human and in mouse. We perform this construction for all possible motifs from 5 to 8 nucleotides in length and then filter the resulting sets looking for two types of evidence of coregulation: first, we analyze the Gene Ontology annotation of the genes in the set, searching for statistically significant common annotations; second, we analyze the expression profiles of the genes in the set as measured by microarray experiments, searching for evidence of coexpression. The sets which pass one or both filters are conjectured to contain a significant fraction of coregulated genes, and the upstream motifs characterizing the sets are thus good candidates to be the binding sites of the TF's involved in such regulation. In this way we find various known motifs and also some new candidate binding sites.Comment: 22 pages, 2 figures. Supplementary material available from the author

    RNA structure analysis : algorithms and applications

    Get PDF
    In this doctoral thesis, efficient algorithms for aligning RNA secondary structures and mining unknown RNA motifs are presented. As the major contribution, a structure alignment algorithm, which combines both primary and secondary structure information, can find the optimal alignment between two given structures where one of them could be either a pattern structure of a known motif or a real query structure and the other be a subject structure. Motivated by widely used algorithms for RNA folding, the proposed algorithm decomposes an RNA secondary structure into a set of atomic structural components that can be further organized in a tree model to capture the structural particularities. The novel structure alignment algorithm is implemented using dynamic programming techniques coupled by position-independent scoring matrices. The algorithm can find the optimal global and local alignments between two RNA secondary structures at quadratic time complexity. When applied to searching a structure database, the algorithm can find similar RNA substructures and therefore can be used to identify functional RNA motifs. Extension of the algorithm has also been accomplished to deal with position-dependent scoring matrix in the purpose of aligning multiple structures. All algorithms have been implemented in a package under the name RSmatch and applied to searching mRNA UTR structure database and mining RNA motifs. The experimental results showed high efficiency and effectiveness of the proposed techniques

    Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach

    Get PDF
    Reeder J. Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach. Bielefeld (Germany): Bielefeld University; 2007.Our understanding of the role of RNA has undergone a major change in the last decade. Once believed to be only a mere carrier of information and structural component of the ribosomal machinery in the advent of the genomic age, it is now clear that RNAs play a much more active role. RNAs can act as regulators and can have catalytic activity - roles previously only attributed to proteins. There is still much speculation in the scientific community as to what extent RNAs are responsible for the complexity in higher organisms which can hardly be explained with only proteins as regulators. In order to investigate the roles of RNA, it is therefore necessary to search for new classes of RNA. For those and already known classes, analyses of their presence in different species of the tree of life will provide further insight about the evolution of biomolecules and especially RNAs. Since RNA function often follows its structure, the need for computer programs for RNA structure prediction is an immanent part of this procedure. The secondary structure of RNA - the level of base pairing - strongly determines the tertiary structure. As the latter is computationally intractable and experimentally expensive to obtain, secondary structure analysis has become an accepted substitute. In this thesis, I present two new algorithms (and a few variations thereof) for the prediction of RNA secondary structures. The first algorithm addresses the problem of predicting a secondary structure from a single sequence including RNA pseudoknots. Pseudoknots have been shown to be functionally relevant in many RNA mediated processes. However, pseudoknots are excluded from considerations by state-of-the-art RNA folding programs for reasons of computational complexity. While folding a sequence of length n into unknotted structures requires O(n^3) time and O(n^2) space, finding the best structure including arbitrary pseudoknots has been proven to be NP-complete. Nevertheless, I demonstrate in this work that certain types of pseudoknots can be included in the folding process with only a moderate increase of computational cost. In analogy to protein coding RNA, where a conserved encoded protein hints at a similar metabolic function, structural conservation in RNA may give clues to RNA function and to finding of RNA genes. However, structure conservation is more complex to deal with computationally than sequence conservation. The method considered to be at least conceptually the ideal approach in this situation is the Sankoff algorithm. It simultaneously aligns two sequences and predicts a common secondary structure. Unfortunately, it is computationally rather expensive - O(n^6) time and O(n^4) space for two sequences, and for more than two sequences it becomes exponential in the number of sequences! Therefore, several heuristic implementations emerged in the last decade trying to make the Sankoff approach practical by introducing pragmatic restrictions on the search space. In this thesis, I propose to redefine the consensus structure prediction problem in a way that does not imply a multiple sequence alignment step. For a family of RNA sequences, my method explicitly and independently enumerates the near-optimal abstract shape space and predicts an abstract shape as the consensus for all sequences. For each sequence, it delivers the thermodynamically best structure which has this shape. The technique of abstract shapes analysis is employed here for a synoptic view of the suboptimal folding space. As the shape space is much smaller than the structure space, and identification of common shapes can be done in linear time (in the number of shapes considered), the method is essentially linear in the number of sequences. Evaluations show that the new method compares favorably with available alternatives
    • 

    corecore