9 research outputs found

    Meta-Alignment with Crumble and Prune: Partitioning very large alignment problems for performance and parallelization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Continuing research into the global multiple sequence alignment problem has resulted in more sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, more tractable sub-problems. We call Crumble and Prune <it>meta-alignment </it>methods because they use existing alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the problem such that many of the sub-problems can be solved independently. The results are then combined to form a solution to the full alignment problem.</p> <p>Results</p> <p>Crumble and Prune each provide a significant performance improvement with little loss of accuracy. In some cases, a gain in accuracy was observed. Crumble and Prune were tested on real and simulated data. Furthermore, we have implemented a system called Job-tree that allows hierarchical sub-problems to be solved in parallel on a compute cluster, significantly shortening the run-time.</p> <p>Conclusions</p> <p>These methods enabled us to solve gigabase alignment problems. These methods could enable a new generation of biologically realistic alignment algorithms to be applied to real world, large scale alignment problems.</p

    Integrated multiple sequence alignment

    Get PDF
    Sammeth M. Integrated multiple sequence alignment. Bielefeld (Germany): Bielefeld University; 2005.The thesis presents enhancements for automated and manual multiple sequence alignment: existing alignment algorithms are made more easily accessible and new algorithms are designed for difficult cases. Firstly, we introduce the QAlign framework, a graphical user interface for multiple sequence alignment. It comprises several state-of-the-art algorithms and supports their parameters by convenient dialogs. An alignment viewer with guided editing functionality can also highlight or print regions of the alignment. Also phylogenetic features are provided, e.g., distance-based tree reconstruction methods, corrections for multiple substitutions and a tree viewer. The modular concept and the platform-independent implementation guarantee an easy extensibility. Further, we develop a constrained version of the divide-and-conquer alignment such that it can be restricted by anchors found earlier with local alignments. It can be shown that this method shares attributes of both, local and global aligners, in the quality of results as well as in the computation time. We further modify the local alignment step to work on bipartite (or even multipartite) sets for sequences where repeats overshadow valuable sequence information. In the end a technique is established that can accurately align sequences containing eventually repeated motifs. Finally, another algorithm is presented that allows to compare tandem repeat sequences by aligning them with respect to their possible repeat histories. We describe an evolutionary model including tandem duplications and excisions, and give an exact algorithm to compare two sequences under this model

    Multiple sequence alignment with user-defined constraints at GOBICS

    Get PDF
    Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathematical rules. For various reasons, such methods may fail to produce biologically meaningful alignments. Herein, we describe a semi-automatic approach to multiple sequence alignment where biological expert knowledge can be used to influence the alignment procedure. The user can specify parts of the sequences that are biologically related to each other; our software program uses these sites as anchor points and creates a multiple alignment respecting these user-defined constraints. By using known functionally, structurally or evolutionarily related positions of the input sequences as anchor points, our method can produce alignments that reflect the true biological relationships among the input sequences more accurately than fully automated procedures can do

    Context-specific methods for sequence homology searching and alignment

    Get PDF

    Generalization of predicates with string arguments

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2002.Thesis (Master's) -- Bilkent University, 2002.Includes bibliographical references leaves 60-63.String/sequence generalization is used in many different areas such as machine learning, example-based machine translation and DNA sequence alignment. In this thesis, a method is proposed to find the generalizations of the predicates with string arguments from the given examples. Trying to learn from examples is a very hard problem in machine learning, since finding the global optimal point to stop generalization is a difficult and time consuming process. All the work done until now is about employing a heuristic to find the best solution. This work is one of them. In this study, some restrictions applied by the SLGG (Specific Least General Generalization) algorithm, which is developed to be used in an example-based machine translation system, are relaxed to find the all possible alignments of two strings. Moreover, a Euclidian distance like scoring mechanism is used to find the most specific generalizations. Some of the generated templates are eliminated by four different selection/filtering approaches to get a good solution set. Finally, the result set is presented as a decision list, which provides the handling of exceptional cases.Canıtezer, GökerM.S

    Tracing the evolution of long non-coding RNAs: Principles of comparative transcriptomics for splice site conservation and biological applications

    Get PDF
    Eukaryotic cells exhibit an extensive transcriptional diversity. Only about a quarter of the total RNA in the human cell can be accounted for by messenger RNA (mRNA), which convey genetic code for protein generation. The remaining part of the transcriptome consists of rather heterogenous molecules. While some classes are well defined and have been shown to carry out distinct functions, ranging from housekeeping to complex regulatory tasks, a big fraction of the transcriptional output is categorized solely based on the lack of protein-coding capacity and transcript length. Several studies have shown, that as a group, mRNA-like long non-coding RNAs (lncRNAs), are under stabilizing selection, however at much weaker levels than mRNAs. The conservation at the level of primary sequence is even lower, blurring the contrast between exonic and intronics parts, which impedes traditional methods of genome-wide homology search. As a consequence their evolutionary history is a fairly unexplored field and apart from a few experimentally studied cases, the vast majority of them is reported to be poorly conserved. However, the pervasive transcription and the highly spatio-temporal specific expression patterns of lncRNAs suggests their functional importance and makes their evolutionary age and conservation patterns a topic of interest. By employing diverse computational methods, recent studies shed light on the common conservation of lncRNA’s secondary and gene structures, highlighting the significance of structural features on functionality. Splice sites, in particular, are frequently retained over very large evolutionary time scales, as they maintain the intron-exon-structure of the transcript. Consequently, the conservation of splice sites can be utilized in a comparative genomics approach to establish homology and predict evolutionarily well-conserved transcripts, regardless of their coding capacity. Since splice site conservation cannot be directly inferred from experimental evidence, in the course of this thesis a computational pipeline was established to generate comparative maps of splice sites based on multiple sequence alignments together with transcriptomics data. Scoring schemes for splice site motifs are employed to assess the conservation of orthologs. This resource can then be used to systemically study the conservation patterns of RNAs and their gene structures. This thesis will demonstrate the versatility of this method by showcasing biological applications of three distinct studies. First, a comprehensive annotation of the human transcriptome, from RefSeq, ESTs and GENCODE, was used to trace the evolution of human lncRNAs. A large majority of human lncRNAs is found to be conserved across Eutheria, and many hundreds originated before the divergence of marsupials and placental mammals. However, they exhibit a rapid turnover of their transcript structures, indicating that they are actual ancient components of the vertebrate genome with outstanding evolutionary plasticity. Additionally, a public web server was setup, which allows the user to retrieve sets of orthologous splice sites from pre-computed comparative splice site maps and inspect visualizations of their conservation in the respective species. Second, a more specific data set of non-colinearly spliced latimerian RNAs is studied to fathom the origins of atypical transcripts. RNA-seq data from two coelacanth species are analyzed, yielding thousands of circular and trans-spliced products, with a surprising exclusivity of the majority of their splice junctions to atypically spliced forms, that is they are not used in linear isoforms. The conservation analysis with comparative splice site maps yielded high conservation levels for both cir- cularizing and trans-connecting splice sites. This fact in combination with their abundance strongly suggests that atypical RNAs are evolutionarily old and of functional importance. Lastly, comparative splice site maps are used to investigate the role of lncRNAs in the evolution of the Alzheimer’s disease (AD). The human specificity of AD clearly points out a phylogenetic aspect of the disease, which makes the evolutionary analysis a very promising field of research. Protein- coding and non-protein-coding regions, that have been identified to be differentially expressed in AD patients, are analyzed for conservation of their splice site and evolution of their exon-intron-structure. Both non-coding and protein-coding AD-associated genes are shown to have evolved more rapidly in their gene structure than the genome at large. This supports the view of AD as a consequence of the recent rapid adaptive evolution of the human brain. This phylogenetic trait might have far reaching consequences with respect to the appropriateness of animal models and the development of disease-modifying strategies.Eukaryotische Zellen legen eine umfangreiche transkriptionelle Vielfalt an den Tag. Nur etwa ein Viertel der in der menschlichen Zelle enthaltenen RNA ist messenger RNA (mRNA), welche den genetischen Code für die Proteingenerierung übermittelt. Der verbleibende Anteil des Transkriptoms besteht aus eher heterogenen Molekülen. Während einigen wohldefinierten Klassen spezifische Funktionen zugeordnet werden können, welche von Zellhaushalt bis zu komplexen regulatorischen Aufgaben reichen, wird ein großer Teil der transkriptionellen Produktion ausschließlich auf Grundlage der fehlenden Kodierungskapazität und der Transkriptlänge kategorisiert. Einige Studien zeigten, dass mRNA-ähnliche lange nicht-kodierende RNA (lncRNA) als Gruppe unter stabilisierender Selektion stehen, wenn auch in einem weitaus geringeren Ausmaß als mRNAs. Die Konservierung auf Ebene der primären Sequenz ist sogar noch niedriger, wodurch der Kontrast zwischen exonischen und intronischen Elementen verschwimmt und Methoden der traditionellen Homologiesuche erschwert werden. Infolgedessen ist die evolutionäre Geschichte der lncRNAs ein recht unerforschtes Gebiet und abgesehen von ein paar vereinzelten Fallstudien wird die große Mehrheit als schwach konserviert vermeldet. Die tiefgreifende Transkription und die in Raum und Zeit hochspezifischen Expressionsmuster von lncRNA deuten jedoch auf deren funktionelle Bedeutung hin und machen ihr evolutionäres Alter und ihre Konservierungsmuster zu einem Thema von Interesse. Durch die Verwendung von computergestützten Methoden konnten jüngste Studien die verbreitete Konservierung von Sekundär- und Genstruktur von lncRNAs aufzeigen, was die Signifikanz von strukturellen Merkmalen in Bezug auf deren Funktionalität unterstreicht. Spleißstellen im besonderen werden oft über lange evolutionäre Zeitspannen erhalten, da sie die Intron-Exon-Struktur des Transkripts bewahren. Folglich, kann die Konservierung von Spleißstellen durch einen Ansatz der vergleichenden Genomik benutzt werden, um Homologie herzuleiten und evolutionär gut konservierte Transkripte unabhängig von deren Kodierungskapazität zu prognostizieren. Da es nicht möglich ist die Spleißstellenkonservierung direkt anhand von experimentellen Indikatoren abzulesen, wurde im Zuge dieser These eine computergestützte Methode entwickelt, welche, basierend auf multiplen Sequenzalignments und Transkriptomikdaten, “Vergleichskarten” von Spleißstellen erstellt. Ein Punktebewertungssystem für Spleißstellenmotive wird benutzt um die Konservierung der Orthologen zu beurteilen. Diese Resource kann anschließend verwendet werden um systematisch die Konservierungsmuster von RNAs und deren Genstrukturen zu untersuchen. Diese Arbeit wird die Vielseitigkeit dieser Methode demonstrieren, indem die biologische Anwendung in drei verschiedenen Studien präsentiert wird. Zuerst wird eine umfassende Annotation des menschlichen Transkriptoms, basierend auf RefSeq, EST und GENCODE, benutzt, um die Evolution von humanen lncRNAs nachzuvollziehen. Es konnte festgestellt werden, dass eine große Mehrheit der menschlichen lncRNAs innerhalb der Eutheria konserviert ist und mehrere hundert bereits vor der Auseinanderentwicklung von Beuteltieren und höheren Säugetieren entstanden. Dennoch zeigen sie eine rasante Veränderung in ihren Transkriptstrukturen, welche darauf hindeutet, dass sie tatsächlich alte Bestandteile von Vertebratengenomen mit bemerkenswerter evolutionärer Formbarkeit sind. Zusätzlich wurde ein öffentlicher Webserver aufgesetzt, der dem Nutzer ermöglicht Datensätze orthologer Spleißstellen aus vorgenerierten Vergleichskarten zu extrahieren und Visualisierungen der Konservierung in den jeweiligen Spezies zu betrachten. Als zweites wird ein spezifischerer Datensatz von nicht-linear gespleißten Latimeria-RNA untersucht um die Ursprünge untypischer Transkripte zu ergründen. Die Analyse der RNA-seq Daten zweier Exemplare des Quastenflossers ergab tausende zirkulärer und Transspleiß-Produkte, wobei die Mehrheit der Spleißverbindungen eine überraschende Exklusivität für untypisch gespleißte Formen aufzeigt, d.h. diese werden nicht für lineare Isoformen genutzt. Die Konservierungsanalyse mit Spleißstellen-Vergleichskarten ergibt hohe Konservierungsniveaus sowohl für zirkulärisierende als auch für trans-verbindende Spleißstellen. Diese Tatsache in Kombination mit ihrem häufigen Vorkommen, deutet stark darauf hin, dass untypische RNAs evolutionär alt und von funktioneller Bedeutung sind. Zuletzt werden Spleißstellen-Vergleichskarten benutzt um die Rolle von lncRNAs in der Evolution der Alzheimer-Krankheit (AK) zu untersuchen. Die Spezifität der AK auf den Menschen weist klar auf einen phylogenetischen Aspekt der Krankheit hin, was deren evolutionäre Analyse zu einem vielversprechenden Forschungsgebiet macht. Proteinkodierende und nicht-proteinkodierende Regionen, bei denen eine differentielle Expression in AK-Patienten erkannt wurde, werden auf die Konservierung ihrer Spleißstellen und Evolution ihrer Exon-Intron-Strukturen hin analysiert. Es kann nachgewiesen werden, dass sich die Genstruktur von sowohl nicht-kodierenden als auch von proteinkodierenden AK-assoziierten Genen schneller entwickelt als das Genom im Allgemeinen. Das unterstützt die Auffassung, dass AK die Folge einer kürzlichen rasanten adaptiven Evolution des menschlichen Gehirns ist. Diese phylogenetische Eigenschaft könnte weitreichende Konsequenzen in Bezug auf die Angemessenheit von Tiermodellen und die Entwicklung von krankheitsmodifizierenden Strategien haben

    Progressive Multiple Alignment with Constraints

    No full text
    A progressive alignment algorithm produces a multi-alignment of a set of sequences by repeatedly aligning pairs of sequences and/or previously generated alignments. We describe a method for guaranteeing that the alignment generated by a progressive alignment strategy satisfies a user-specified collection of constraints about where certain sequence positions should appear relative to others. Given a collection of C constraints over K sequences whose total length is N , our algorithm takes O(K(N 2 +KC)) time. An alignment of the fi-like globin gene clusters of several mammals illustrates the practicality of the method. Key words: Multiplesequence alignment, constrained alignment, dynamic programming 1 Introduction It is straightforward to extend the dynamic programming alignment algorithm (Needleman and Wunsch 1970) to the simultaneous alignment of K ? 2 sequences. However, the O(2 K N K ) execution time for sequences of length N makes it impractical to align more than three seque..
    corecore